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PREFACE 


The first edition of this book occupied for several years the considerable 
gap between the multitudinous textbooks of elementary descriptive statistics 
requiring little mathematical background and the ever-growing technical and 
often highly mathematical literature of statistics scattered throughout many 
Journals. 

Since 1939, however, numerous books have appeared dealing at diFerent 
levels with the mathematical side of statistics. For the serious graduate 
student with adequate preparation there are such works as S. S. Wilks^ 
Mathematical Statistics, H. Cram^r^'s Mathematical Methods of Statistics, and 
the encyclopedic two-volume treatise by M. G. Kendall, T'he Advanced Theory 
of Statistics. The increasing interest in the teaching of statistics in univer- 
sities and colleges has brought forth also several books at the undergraduate 
level These presuppose some mathematical competence on the part of the 
reader but not the degree of maturity required by the specialist treatises 
mentioned above. 

This new second edition is an enlargement and revision of the first and is 
intended to provide as much of the mathematical foundation for the cus- 
tomary procedures of experimental statistics as an undergraduate student can 
reasonably be expected to assimilate. It presupposes the equivalent of a second 
course in Calculus (usually called Advanced Calculus), and preferably a 
course in elementary statistics corresponding approximately to Part I of 
Kenney's Mathematics of Statistics (1954, Van Nostrand). Special functions 
such as the Gamma and Beta functions, and special methods, such as the 
use of the Jacobian in multivariate problems involving change of variable, 
are explained in the text. The elements of matrix algebra are included in 
Chapter X because of the great elegance and conciseness of matrix notation 
in many problems of least squares and multivariate analysis. 

Although the book is thus definitely mathematical, and not for those who 
merely want a set of cook-book recipes for the practical problems of statistics, 
"an attempt has been made to balance the experimental and mathematical 
sides of the subject. Numerous worked examples illustrate the application 
of the theory to concrete problems, while the theory serves to emphasize the 
practical limitations of the mathematical models on which many of the com- 
monly used statistical techniques are based. 

In a field as extensive as that of statistics it is inevitable that a single vol- 
ume, even a fairly bulky one, cannot adequately cover the whole range. 
Some topics, undoubtedly of importance, must be omitted or passed over 
Ightly, and the choice of topics to be presented is lareelv a matter of ner^nn^l 
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tastC; for which there is no accounting In this book the main concern is 
with those problems of estimation and inference that are common to all 
branches of science using experimental or observational data, and in particu- 
lar with the attempt to estimate the characteristics of a population from those 
of a sample and to assess the reliability of such estimates. Throughout the 
book, emphasis is placed on the fallibtlity of statistics, on the sampling vari- 
ance and standard error of a statistic, and on the idea of a confidence interval 
within which, with a specified degree of confidence, one may claim that the 
true value of some estimated quantity will lie. 

Since the concept of probability is fundamental to statistical inference, the 
first chapter is concerned mainly with the elements of the calculus of proba- 
bility No attempt is made to give a rigorous foundation, which would re- 
([uire the mathematical subtleties of set theory and measure theory, but the 
basic formulas are developed from the simple classical definition of proba- 
bility, which is adequate for many problems. 

The first five chapters, dealing mainly with theoretical distributions, are 
m a sense introductory to the remaining chapters which treat statistics proper, 
and in particular with the exact distributions of sample statistics calculated 
from small samples. In most cases the distributions are worked out in detail, 
but occasionally, as for example for the distribution of the correlation coeffi- 
cient in a sample from a bivariate normal population wifch a non-zero correla- 
tion, the complete 'derivation is too long to be given in full and some steps 
are merely indicated. The techniques of analysis of variance and covariance 
are discussed, at least for the more commonly occurring experimental situa- 
tions; experimental design, however, is dealt with only very briefly. Multi- 
variate analysis has been for the most part omitted because of its mathematical 
difficulty, but some discussion is given of multiple regression and of multiple 
and partial correlation. Some of the more theoretical aspects of statistical 
inference, the method of maximum likelihood, and the testing of hypotheses 
are given in Chapter XII, although these ideas and methods have been used 
quite freely in earlier chapters. 

In this book the older definition of sample variance, ^ (X^ — Xy/N, has 

N 

been used, instead of the definition ^ {X^ — Xy/{N — 1) which is favored 

by many writers. This has been done deliberately; after due consideration, 
in the belief that the advantages outweigh the disadvantages. It is true that 
the sample variance is an unbiased estimate of the population variance with 
the second definition, but not with the first. However, the necessity of a 
factor Ar/(iV — 1) is not important to the mathematical statistician, what- 
ever it may be to the experimentalist, and similar, but different, factors to 
remove bias occur in other connections. There is a serious pedagogical ob- 
jection to iV — 1 Since the notion of variance usually precedes the concept 
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of sampling, N — 1 seems quite unnatural to students and the reason for 
using it is difficult to explain. Moreover^ when students come to calculating 
the variance for a theoretical distribution such as the binomial, they use N 
(or its equivalent) instead of JV' — 1. This is likely to seem arbitrary and 
puzzling to them if variance has previously been defined with iV — 1. It 
seems to be true that most mathematical statisticians (including Cramer and 
Kendall, but not Wilks) prefer the definition with iV', whereas most experi- 
mentalists, with the high authority of R. A. Fisher, prefer to use A — 1, 

In the arrangement of the book we have, as a rule, preferred to deal with 
purely mathematical topics as they naturally arise, instead of segregating 
them all in a preliminary chapter at the beginning. However logical, this 
latter practice is likely to be discouraging to those students whose interest 
is mainly in the applications. For this reason, also, we have not collected 
together all the “non-centrar’ distributions which are referred to in the book 
but have dealt with them as they occur. 

The number of problems in the first edition has been considerably increased. 
Some of these are mathematical exercises and some are numerical problems 
illustrating the application of the theory to actual or fictitious experimental 
data. Where it has seemed advisable, hints for the solution are given, and 
answers are supplied to most of the numerical problems. The tables given 
in the Appendix suffice for the commoner statistical tests, but the student 
should, if possible, have access to more complete tables such as those of Fisher 
and Yates (Statistical Tables for Biological^ Agricultural and Medical Research, 
3rd edition, 1949, Oliver and Boyd) or Karl Pearson (Tables for Statisticians 
and Biometricians, Part I, 2nd edition, 1924, University College, London). 
These, and other special tables, are frequently referred to in the text. Prac- 
tice in the use of a computing machine is advisable, particularly in the more 
elaborate calculations which occur in the analysis of variance, partial and 
multiple correlation, and the solution of sets of linear equations. 

It is hardly possible for the authors to express their manifold indebtedness 
to all those writers from whose books, papers, and lectures they have derived 
help and inspiration. A list of references is given at the end of each chapter, but 
this is in no sense intended as a bibliography of the subject. It serves mainly 
to suggest a few books and papers for supplementary reading and to indicate 
the source of further details for many of the proofs and discussions given in 
the text. 

In particular the authors are indebted to the publisher's readers who ex- 
amined the book in manuscript and whose criticisms and suggestions helped 
greatly in giving it a final polishing. 

February, 1951 


E. S. K. 
J. F. K. 
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FOREWORD 


The following sections are more difficult mathematically, or are less impor- 
tant in the general scheme of this book, and may be omitted on a first reading: 

§§ 1 . 7 , 2 . 15 , 2 . 16 , 4 . 12 , 4 . 16 , 4 . 17 , 5 . 4 , 5 . 12 , 5 . 15 , 6 . 20 , 6 . 21 , 6 . 22 , 7 . 7 , 7 . 8 , 
7 . 9 , 7 . 18 , 8 . 4 , 8 . 5 , 8 . 13 , 8 . 14 , 9 . 7 , 9 . 8 , 9 . 10 , 9 . 12 , 9 . 15 , 9 . 17 , 9 . 18 , 10 . 11 , 10 . 12 , 
10 . 16 , 10 . 17 , 11 . 1 , 11 . 2 , 11 . 7 , 11 . 8 , 11 . 9 , 11 . 16 , 11 . 18 , 11 . 19 , 11 . 21 , 12 . 11 , 
12 . 15 , 12 . 16 , 12 . 23 . 

Chapters I to V (Probability and Distribution Theory) could be covered 
substantially in one quarter with three lecture periods a week; Chapters VI 
to IX (Sampling Statistics and Analysis of Variance) in a second quarter; 
and Chapters X to XII (Least Squares, Multivariate Problems, and Statis- 
tical Inference) in a third. For a two-semester course, the major part of 
Chapters I to VII would, in the first semester, give a grounding in the basic 
ideas of small-sample theory. The second semester might then deal with 
special techniques and methods in statistical theory, as covered in Chapters 
VIII to XIL 




CHAPTER I 


PROBABILITY 

LI Importance. The theory of probability is one of the most interesting 
branches of modem mathematics and is becoming conspicuous for its applica- 
tions in many fields of learning. This subject is of fundamental importance, 
not only in the theory of insurance and statistics, but also in various branches 
of the biological and physical sciences. The following quotations from con- 
temporary writers indicate the importance of probability theory in the phi- 
losophy of modem science. 

It was, I think, Huxley who said that six monkeys, set to strum unintelligently 
on typewriters for millions of miliions of years, w^ould be bound in time to write all 
the books in the British Museum. If w’e examined the last page which a particular 
monkey had typed, and found that it had chanced, in its blind strumming, to type 
a Shakespeare sonnet, we should rightly regard the occurrence as a remarkable ac- 
cident, but if we looked through all the millions of pages the monkeys had turned 
off in untold millions of years, we might be sure of finding a Shakespeare sonnet some- 
where amongst them, the product of the blind play of chance. . . .* 

These and other considerations have led many physicists to suppose that there 
is no determinism in events in which atoms and electrons are involved singly, and 
that the apparent determinism in large-scale events is only of a statistical nature. 
When we are dealing with atoms and electrons in crowds, the mathematical law of 
averages imposes the determinism which physical laws fail to provide. ... We can 
only speak in terms of probabilities. 

— The Mysterious Universej Sir James Jeans. 

In order to understand the nature of knowledge about social and economic life, 
it is necessary to know something about the theory of probability; because knowledge 
in these fields, in general, is essentially indeterminate knowledge. There are two 
fundamental ideas which need to be grasped in order to understand the social sciences. 
The first idea is that all science is philosophical, . . , The time honored aim of phi- 
losophy has been to discover and interpret (to the extent possible to the human mind) 
‘the characteristics of nature. By nature is meant all things, material and psychic, 
external to man, and man himself. In many fields the minds of men have penetrated 
into the mysteries of nature and have produced knowledge concerning them. In the 
physical aspects (both external to man and in man) great progress has been made 
towards the attainment of apparently precise knowledge, within certain definite 

* This illustration is picturesque, but it may be noted that the time the monkeys might 
be expected to take to produce one sonnet by chance would be vastly greater than the whole 
estimated time that our solar system has existed, or indeed than the time the earth is likely 
to remain a possible abode^of life. (See Problem 21 at„the end of the chapter ) 
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limits; wMe in the field of the psychic the progress has been towards increasing the 
probabilities of truth of a great variety of hypotheses. But it is characteristic of the 
psychic aspects of knowledge that the facts in those fields are indeterminate, not 
precise, and apparently dynamic. Even in the physical and chemical world, the 
discoveries of recent years have emphasized a great realm of indeterminacy, particularly 
when confronting great velocities and infinitely small particles within the atom. 
Thus the second idea to grasp is that in all fields of knowledge, even the physical, 
beyond the limited range of relatively precise knowledge accumulated by man, there 
is a vast frontier of speculation. It has been the function of scientific method — the 
new tool of philosophy — to penetrate ever deeper into this realm of speculative 
knowledge. Primarily this has been made possible by the development of the theory 
of probabilities. 

— Elementary StatisticSf James G. Smith. 

There exist in nature systems of chance causes which operate in such a way that 
the effects of these causes can be predicted — by making use of customary probability 
theory in which objective probabilities in the limiting statistical sense are substituted 
for the mathematical probabilities. 

— Ecanomic Control of Qmlity of Manufactured Product, W. A. Shewhart. 

It appears likely that the further development of the theory of probability in the 
next few decades may turn out to be a major chapter in the history of science. 

— Science, January 18, 1929. 

The great extension in the use of statistics in the last two decades has been as- 
sociated with and largely made possible by mathematical developments based upon 
the theory of probability. 

— Harold Hotelling, Journal American StaUstical Association, March 

Supplement, 1931 

1.2 The Classical Definition of Probability. The notion of probability 
plays a basic role in statistical theory,^ yet it is not easy to define in a satisfac- 
tory way. We shall consider briefly some of the attempts at definition. 

The classical definition, as given by Laplace, is as follows: If an event can 
take place in n mutually exclusive ways, all equally likely, and if r of these 
correspond to what may be called success, then the probability of success 
in a single trial is r/n. Thus, if a coin is equally likely to fall “head'' or 
“tail," the probability of “head" is 

This definition is applicable to many problems suggested by games of chance, 
provided that certain assumptions are made. We assume, for instance, that" 
dice are homogeneous and imbiased, or that a deck of cards is perfectly 
shuflfied. It was, in fact, in order to deal with such problms that mathe- 
matical probability was originally developed in the latter half of the seven- 
teenth century. 

The definition has, however, obvious deficiencies. It is not clear how one 
is to know whether the various ways are “equally likely," nor how one is to 
deal with problems where they are known not to be equally likely. There is, 



Sec, 3 


The Frequency Definition 


3 


for example, a definite probability of throwing 6 with a given biased die, al- 
though it may not be Furthermore, there is a class of problems in which 
the number of ways cannot be enumerated because they are infinite. We may 
ask, for example, what is the probability that a point selected at random on a 
circular target should lie in the bulFs-eye, and we can give a numerical answer, 
once an unambiguous procedure has been specified for making the random 
selection. For these various reasons many writers have preferred a different 
form of definition. • 

1.3 The Frequency Definition. The probability that a male United States 
citizen, aged 50, and able to pass an ordinary medical examination, will die 
within a year has undoubtedly a meaning, although not one that can be given 
in terms of the classical definition. Such a probability is assessed by an in- 
surance company when it fixes the premium that the man must pay to buy 
$1000 of insurance. Bases for assessing this probability are various mortality 
tables prepared from the records of large companies. In other words, the 
probability is estimated statistically, and not a priori. Such procedures have 
suggested the following definition of probability: 

If an event has occurred r times in the way described as ''success,^' in a 
series of n independent random trials, all made under the same essential con- 
ditions, the ratio r/n is called the relative frequency of success. The limit 
of r/n, as n tends to infinity, is the probability of success in a single trial. 

There are limitations to this definition also. How is one to know whether 
all the essential conditions have remained unchanged? In the illustration 
given above, from insurance, we know that conditions have not remained con- 
stant. Because of improvements in hygiene and advances in medical science, 
there has been for many years a steady increase in normal life expectancy. 
However, this difficulty is common to all branches of science in which it is 
customary to make replications of an experiment or observation; the scientist 
must rely on his judgment as to what factors may be expected to exert a per- 
ceptible influence on the result. Another difficulty is that, on this definition, 
we can never know an empirical probability precisely, because we can never 
make infinitely many trials. But when n is large a very good estimate of the 
probability can be made, and this is usually sufficient. 

It must be noted that the word “ limit is used in a special sense here. 
JMathematically, r/n tends to a limit p if for any given positive number € we can 
find a number N such that for all n > iV we can be sure that | r/n — p \ < e. 
That is, for large enough n, the difference between r/n and p is, and remains 
as n increases, less than €. But in a sequence of chance events this cannot be 
guaranteed. After throwing a coin 10,000 tunes, it is possible, although not 
likely, that the next 1000 tosses should all turn out to be heads. 

We need, in fact, a new definition of limit appropriate to a sequence of ran- 
dom variables. It is a matter of experience that in a large number of tosses of 

a coin the relative frequency of heads’^ fluctuates less and less violently as 

• > 
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the number n increases, and shows a marked tendency to become practically 
constant when n is very great. Provided the conditions may be assumed to 
be uniform, the relative frequency of success in a series of random trials always 
shows this tendency to settle down to a comparatively stable value. We can, 
therefore, assert that to the random event E we may assign a number p 
(0 < p < 1) called the probability of Ej such that in a long series n of repeti- 
tions it is practically certain that the relative frequency of occurrence of Ej 
say r/n, will be approximately equal to p. This is expressed by saying that 
r/n converges in probability or converges stochastically to the limit p. This 
notion may be made more precise. We can assign a small positive number € 
and say that r/n is approximately equal to p if | r/n — p | < €, and Bernoulli's 
theorem (§ 2.12) then gives a lower limit for n corresponding to an agreed 
interpretation of '^practically certain." 

The word "random" was used previously in defining probability. It is a 
familiar term in statistics, but one that is not easy to define. Roughly, a 
sequence of trials is random if the results follow no recognizable pattern. In 
the unending sequence of digits that represents as a decimal fraction, 
namely, .142857142857142867 • • • , the relative frequency of the digit 1 
approaches | as a limit, but the sequence is obviously not random. A se- 
quence of O's and I's, such as 011010111001010001 • • • may be described as 
random if when we pick out any subsequence by some pre-assigned rule, the 
limiting frequency of O's in the subsequence is the same as it is in the original 
sequence. We may, for example, pick out every second term, or every term 
following a zero, or make any other rule we please, provided, of course, that 
the decision as to whether a particular term is picked does not depend on that 
term. We could not, obviously, decide to pick all the zeros. 

This definition of randomness is due to von Mises.^ There has been con- 
siderable discussion as to whether a truly random sequence, known as a coZZec- 
tivcy actually exists. It seems, however, that, given any distribution, it is 
possible to define a collective in which the relative frequencies tend to the 
limits prescribed by the distribution. The laws of probability relate to such 
a collective which, as von Mises points out, is an ideal thing like the points 
and lines of geometry. Actual pencil or chalk lines are only approximations 
to the ideal lines about which we reason, and in a similar way actually observed 
sequences of events are more or less crude approximations to ideal collectives. 
How far the laws of probability apply to events in the real \forM can b^ 
determined only by trial and observation. ’ . 

1.4 Other Definitions of Probability. A fundamentally different approach 
to the subject is that of J. M. Keynes ^ and Harold Jeffreys ® to whom proba- 
bility is a measure of the degree of rational belief in a proposition. In this 
sense one can speak of the probability, for example, that Bacon wrote the plays 
commonly attributed to Shakespeare, although such a probability is meaning- 
less on either the classical or the frequency definition and should rather, as 
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Hotelling suggests, be called “credibility/^ By making certain more or loss 
arbitrary assumptions, it is possible to calculate the probabilities of compound 
events, but we shall not pursue this matter further. 

A more sophisticated approach to the definition of probability than any of 
those discussed above is the axiomatic one. The method consists in laying 
down a number of axioms and proceeding to deduce the theorems required. 
The axioms, of course, are not completely arbitrary, since we wish to deduce 
theorems that are approximately verified by experience, but, once they are 
laid down, the rest of the theory follows with logical rigor. An example of 
such a treatment is to be found in Kolmogoroff’s little book Foundations of the 
Theory of Probability This is also the method used by Cramer in his valuable 
work Mathematical Methods of Staiisticsf^ where the treatment is based on the 
theory of sets of points and of the “measure’’ of a set. At the level of this 
book it will not be possible to enter into the details of a rigorous treatment, 
but in § 1.7 w’'e give an indication of the axiomatic approach.^^ 

1.5 Some Theorems of Algebra. 

Theorem 1.1. The number of permutations of n distinguishable things taken 
r at a time is denoted by the symbol P(nj r) and given by 

(1.1) P(n, r) = n(n - l)(n - 2) < • • (n - r + 1) = 
where the symbol nl n factoriaV^) is defined by 

(1.2) n! = n{n - l)(n ~ 2) • • • 3 • 2 • 1, n = 1, 2, 3 • • • 
and 0! = 1, 


Theorem 1.2. If n things consist of ni all alike of type Ti, n 2 all alike of 

k 

type Ttj • • • , Uk all alike of type Tk, so that ~ number of permuta- 

1 

turns of all n things is 


nl 


niln2l • • • 

This follows since the n^l permutations of the Ut things among themselves 
are indistinguishable. 

Theorem 1.3. The number of combinations of n different things taken r at a 

tfw£, is denoted by C(n, r), or more commonly in recent times &2/ j which may 
be read above r,” and is given by ^ ' 


(1.3) 


Since 0! = 1, == 1. We define as equal to 0 for all r > n. 

Theorem 1.4. The total number of combinations of n distinguishable things 
taken 1, or 2, • • • orn a time is 2^—1, 
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Note. that is the coefficient of the (r + l)th term in the binomial expan- 

sion of (x -h p)”, that is, 



If we let a: = 2 / = 1, this becomes 

(1.5)* 2”^ = 1 + -f ^ 2 ) (n) 

Theorem 1.5. The number of ways in which n distinguishable things can be 
divided into k classes^ ni in the first classj ^2 in the second and so oUj is 

n\ 

ni\n^\ • • • nk\ 

where ^n^ = n. This number is the coefficient of in the multi- 

nomial expansion of (xi + X 2 + * * • + 

Theorem 1*6. The number of ways of putting m distinguishable objects into 
n exactly similar compartments {n ^ mi) j not more than one in each compartment j 
is n\/(n — m)l This follows since n — m of the compartments, being empty, 
are indistinguishable among themselves. 

If the objects are not distinguishable, the number is 

n\ fn\ 

(n — m)!m! 

since the m occupied compartments cannot now be distinguished among themselves. 
Kearrangements of the order of occupied and empty compartments coimt as 
different ways. 

Theorem 1.7. The number of ways of putting m distinguishable objects into 
n ordered compartments, when any number from Q tom may go in each compart- 
ment, is n^. 

This follows since each object may go into n different compartments, irre- 
spective of where the others go. 

Theorem 1.8. The number of ways of putting m indistinguishable objects into 
n ordered compartments, any number in each compartment, is 

(„+»-.) 

If the objects are indistinguishable the number of ways required is the num- 
ber of ways of separating the m objects, supposed set out in a row, by n — 1 
partitions. These partitions in effect divide the objects among n compart- 
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ments, labeled, say, from left to right. The number of ways is that of arrang- 
ing n + m — 1 things, that is, m objects and n — 1 partitions, m all alike of 
one^ kind and n ~ 1 all alike of another kind, and hence is 

(n + m — 1) ! _ M ^ 1\ 

ml(n — 1)! \ m } 


Theorem 1.9 (Binomial Theorem for a negative integral index). If ms a 

positive integer - • 

(1.6) (1 + x)-" = 1 - nx + n t + 


2-3 


for values of \ x\ < 1. This may he written 
(1.7) 


(1 + 0 ;)- = 


i :=0 


1.6 Some Fundamental Theorems of Probability. In a set of random trials 
there may be various possible outcomes of a single trial. These will be called 
events. An event may be simple or compound. 

A simple event can be represented by some value or values of a single random 
variable x, where x is a real number. In some cases, x can take only a set of 
discrete values; in other cases it may range over a finite or infinite interval of 
the real axis. Thus, the results of successive tosses of a coin may be repre- 
sented by a sequence of O's and Ts, so that x can take only these two values. 
If rr is a measurement of height for a group of a hundred grown men, it may 
range an37where from, say, 60 to 80 inches, but is quite unlikely to be outside 
this interval- An ^^event” may, for instance, be a measured height in the 
interval 70 to* 71 inches. 

A compound event requires two or more random variables for its representa- 
tion, Thus, if two dice are thrown at the same time, the real number x (with 
discrete values 1, 2, • • • 6) corresponds to one die and the real number y (with 
the same set of discrete values) to the other die. The compound event con- 
sisting of the simultaneous observation of the two dice, which are supposed 
to be distinguishable, may be represented by a point of rectangular coordinates 
(Xj y). These points form a square lattice of 36 points in the x 2 /-plane. If 
x represents height and y weight for a group of men, both x and y may range 
over finite intervals, and the compound event consisting of a measurement of 
height and a measurement of weight for one particular individual is represented 
by a point of coordinates (x, y) lying in a certain region of the xy-plsm. The 
extension to more than two variables is obvious. 

We shall presently establish certain basic theorems of probability with the 
help of the classical definition (§ 1.2). It will be assumed that the phrase 
“equally likely has a well-understood sense. 

It will be convenient to use the notation on the next page: 
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p(A) = probability that the simple event A happens, 
p(4) = probability that A does not happen (often called the probability 
of not-i4), 

p{AB) = probability that both A and B happen on the same trial (prob- 
ability of the compound event A and B). 
p{A + JS) = probability that either A or B happens or that both happen 
(probability of the compound event A or B)» 
p{A \ B) = probability that .4 happens wheij, it is .known that B happens 
(conditional probability of A, given B). 

If *4 and B are mutually exclusive, they cannot both happen on the same 
trial, so that p{AB) == 0. Thus A might be the event of throwing 6 and B 
the event of throwing 1 with a single die. But if 4 is the event of throwing a 
number greater than 4 and B the event of throwing an even number, both 
events could happen in a single trial. Any event which cannot happen has 
probability 0, and any event which always happens has probability 1 . It may 
be noted that oh the frequency defimtion of probability the converses of 
these statements are not true. An event has probability 1 if the limiting 
relative frequency of its occurrence is 1, and this does not imply that it always 
happens. It might, for instance, fail once in the first ten trials and never fail 
again. Similarly an event has probability 0 if the limiting relative frequency 
of its occurrence is 0, which does not mean that it never happens. 

Theorem 1.10. J/ A and B are mutually exclusive, 

(1.8) p{A + 5) = piA) + p{B) 

This is called the addition theorem for probabilities. 

Suppose that, out of n mutually exclusive and equally likely cases, a corre- 
spond to the event A and b to the event B. If these n are all the possible cases, 
p(A) = a/n, and p{B) = b/n. Also, by definition, p{A -f J5) = (a -f h)/n, 
since there are a + 6 cases which correspond to either A or B. The theorem 
follows at once. 

Theorem 1.11. If Ai, A^, * • • are mutually exclusive events, 

(1.9) p{Ai -)- A 2 + • • • + An) ~ p(Ai) + p{A^ p(An) 

The proof is an obvious extension of that just given. 

Theorem 1.12. p(A) + p{A) = 1. For by Theorem 1.10 the*left-hand 
side is equal to p(A + A) == 1, since A must either happen or not happen. 

Theorem 1.13. 

(1.10) p(A + B) - p{A) + piB) - p{AB) 

Since A must occur in combination either with B or not-B, we have 
p(A) = p{AB) + piAB) 
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Similarly 

p(B) = p{AB) + p(AB) 

By the definition of p{A + J5), and Theorem 1.11, 

p(A + jB) == p(AB) + p(AB) + p{AB) 
whence equation (1.10) follows at once, on substituting for p{AB) and p{AB), 


Theorem 1.14. 

( 1 . 11 ) p{Ai A- At At Az) = 

p{Ai) + piAt) + piAz) pi^AiAt) ““ p{AiAz) — p{AtA^ + p^AiAtAA 

If B stands for the event + ^3 (that is, the occurrence of at least one of 
the two events At and 4 3), 

p(B) = piAt) + p(Az) - piAtAz) 

and 

(1.12) p{Ai + ^42 + ^ 3 ) = p{Ai + B) = p{A^ + p{B) — p{AiB) 

By definition and Theorem 1.13, 

p{AiB) = piAiAt + . 4 1 ^ 43 ) = p{AiA^ + p{AiA^ — p{AiAtA^ 

Hence, by substituting for p{B) and p{AiB) in equation (1.12), we arrive at 
( 1 . 11 ). This theorem may be extended by induction to any finite number of 
events. The general statement is 

(1.13) p(4i + A,+ --- + A.}^ - ^'p{AA;) 

i ij 

+ '^''piAA,A{) — • • • 

ijk 

where means the sum over all i and j, with i 9 ^ j, 

means the sum over all f , j, ft, with no two of them equal, and so on. 
m 

Theorem 1.15. 

(1.14) p{AB) = p{A)p{B 1 A) = p{B)p{A | B) 

For, out of the total of n possible and equally likely cases, let r be favorable 
to and s to both A and jB. Then, by definition, p(il) = r/n^p{B \ A) — s/r, 
p{AB) “ s/n. But s/n = s/r • r/n, which proves the first part of the 
theorem. The second follows on interchanging A and B and noting that 
p{AB) = p{BA). 

This theorem may also be extended to compound events with more than 
two constituents. Thus, 

(1.15) p{ABC) = p{A)p{B I .l)p(C I AB) 
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The events A and B are said to be mdepevdent if the probability of each is 
unaffected by whether the other happens or not. That is, A and B are in- 
dependent if p(A I B) = p{A) and p(B | A) = piB). Actually, one of these 
conditions entails the other, since iip{B\ A) - p{B), we have from (1.14) 

(1.16) pUB) = piA)p{B) 

and it then follows from (1.14) and (1.16) that p{A | B) = p{A). The result 
expressed by equation (1.16) is 

• 

Theorem 1.16. If A and B are independent, the probabtUiy of the compound 
event AB is the product of the respective probabilities. This ts called the multi- 
plication theorem for probabilities. 

The theorem may be extended to three or more independent events. Thus 

(1.17) p(ABC) = p{A)p{B)piC) 

It must be noted, however, that independence means what is sometimes 
called ^‘complete mutual independence.'^ Each event must be independ- 
ent not only of each of the others but also of any combinations of the 
others. Even for only three events this requires four conditions, namely, 
p{A I B) = piA), piA I C) = p{A), p{B 1 C) = p(B), and p{C | AB) - p{C). 
There are five other conditions which may be deduced from these by means of 
(1.15) and the equivalent forms given by interchanging A, B, and C in all 
possible ways. 

A simple example' given by Uspensky illustrates the fact that three events 
ABC may be '^pairwise independent" without being independent in the fore- 
going sense. Four discs in a bowl are numbered respectively 112, 121, 211, 
222, and one disc is drawn at random. Let A, B, C denote respectively the 
events ‘Hhe first digit is 1," ^Hhe second digit is 1," and ^Hhe third digit is 1," 
in the number drawn. Then it is easily seen that p(A) = p{B) = p(C) = f, 
andp(AjB) = p(AC) = p(BC) - sothatp(i4J5) = p{A)p{B), etc. Hence 
A and B, A and C, and B and C are separately independent. However, 
p(ABC) = 0 and hence is not equal to p{A)p{B)p(C). In other words, 
A, B, C are not independent. We shall use the word independent" in the 
future to mean this complete mutual independence, so that the multiplication 
theorem will always hold for independent events. 

1.7. The Axiomatic Treatment of Probability. The theorems of § 1.6 have 
been established only for situations in which the classical definition'' applies. 
They may be established, with greater difficulty, on the frequency definition 
of probability, using the idea of '^collectives." The axiomatic approach is 
somewhat as follows. 

If the event A is represented by a certain set S of values of x, we assume 
that there exists a function fiS), called the probability of A, such that 
0 ^ f{S) ^ 1 and such that f{Si -f- 4 S 2 “h • • • “b Sn) — f{Bi) -b f{Si) • -f* 
/(Sn), where Si, &, • • • Sn are sets of values of x, no two of which have any 
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point in common. Clearly, Si, ‘ • Sn correspond to mutually exclusive 
events, and the foregoing statement represents the addition theorem for proba- 
bilities. Extending this theorem to an infinite number of non-overlapping sets, 
we assume that f{S) is a completely additive function of S, That is, if we write 
m = V (A), 

(1.18) p{Ai + A 2 + • • •) = P(^i) + ‘ • 

If X and y are random variables, we further assume that the combination 
{x, y), which corresponds to a compound event, is also a random variable and 
so has a probability distribution. This means that, if T is a set of values of 
y corresponding to the event B, so that p{B) is a function of T, say g(T), the 
probability of the compound event AB is a function of S and T, say h(S, T), 
which has the same general properties }{S). That is, it is non-negative, 
completely additive, and is equal to 1 when S and T cover the whole range 
of possible values of their respective variables. 

We now define the conditional probabilities of JS, given A, and of A, given 
B, by 

(1.19) p(B I A) = h{S, T)/m, piA I B) = h(S, T)/giT), 

which correspond to (1.14). It is understood, of course, that/(/S) and g(T) 
are not zero. 

For fixed set S, the ratio h{S, T)/f{S) is a completely additive function of 
r, and when T includes the whole range of y, h{S, T) * = f(S). The ratio has, 
therefore, the properties of a probability, and the same statement holds for 
the second ratio in (1.19). 

Finally, we define the two events A and B as independent when 

(1.20) h{S,T) =^m-g{T) 
for any sets S and T, By (1.19), this implies 

p{B 1 A) = giT) = p(B) 

and 

piA\B) =/(S) =p(A) 

corresponding to our previous definition of independence. 

The theorems of § 1.6 have, therefore, a wider validity than appears from 
The proofs given, provided that we interpret “probability’^ in the above 
general sense. How far the assumptions made correspond to real-life situa- 
tions is a rnatter to be considered in each individual problem. 

L8 Some Examples of Probability. In this section we give a few typical 
calculations m elementary probability. Although they are concerned with 
rather trivial problems of cards, dice, and betting, they provide excellent 
illustrations of the fundamental theorems and in such problems the assump- 
tion of equally likely cases is a fairly reasonable one. 
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Example 1. Find the probability that a hand of 13 cards contains 4 aces. 

The fundamental assumption here is that any specified hand is as likely as any other, dealt 
from a well-shuffled deck. The total number of ways of dealing a hand is, by Theorem 1.3, 
(520/(130(39!). If the hand contains 4 aces, the remainmg 9 cards may be chosen from 
the 48 cards other than aces. The number of ways is (48!)/ (9! 39!). The probability re- 
quired is the ratio of this number to the total number of ways of dealmg the hand. Hence 

48! 13! ^ 10 11 12 13 JL 

^ “ 9! 52! 49 60 51 52 “ 379 


Example 2. There are 7 horses m a race. A man haS 5 dollars to bet with, in multiples 
of 1 dollar. In how many ways can he bet on one or more horses to win? 

By Theorem 1.8 the number is ^ j = 462. We here think of the number of ways of put- 
ting 5 objects (dollars) into 7 distinguishable'compartments, any number from 0 to 5 in one 
compartment. 

Example 3. Calculate the probability of obtaining a total of s points in a throw of n dice. 

The total nmnber of ways, all assumed equally likely, in which n dice may appear is 6", 
assuimng that the dice are distinguishable. The number of favorable ways (in which the 
total number of points is s) is equal to the number of ways in which n integers ranging from 1 
to 6 can add up %o {n s S 6n). 

This number is the coefficient of x* in f{x) — (rr -j- -{- a:* -f- since every 

favorable arrangement contributes one term to a?* in the expansion of f{x). 

Now 

J{x) = -f- ic + • • + j 

so that the coefficient of x'* in /(a;) is the coefficient of x^’~^ in (1 — a:0"(l — x)'^^ that is, in 




6A; - 
6k - 


i) 


Putting 6& -h Z = s — n, we obtain for this coefficient 

1 “' , „ , 

i p - denotes the integer equal to or next below ^ 

The required probability is, therefore. 


where I 


Ct,—n 


L-r] 


Thus for n 
probability is 


3 and s « 12, = 1.5^ go that k takes only the values 0 and 1. The 



Example 4. We will calculate the probability of winning in the game of craps.” One 
player will win if (a) he throws 7 or 11 with two dice, (b) he throws 4, 5, 6, 8, 9, or 10 and 
if, on repeated throwing thereafter, the same number recurs before 7 shows up. We assume 
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the dice to i)c perfect, so that every combination ha^s an equal chance * of appearing. Of 
the 36 ways m which two dice may appear, six give a total of 7 and two give 11. Hence by 
the addition theorem the chance of 7 or 11 is 6/36 + 2/36 == 2/9. 

The chance of 4 is 3/36, and the chance that it will reappear before 7 is 3/(3 + 6) — 1 /3, 
since the probabilities of 4 and 7 are proportional to 3 and 6 respectively. Hence by the 
multiplication theorem, the probability that 4 is thrown and that 4 reappears before 7 is 
3/36 • 1/3 = 1/36. This is the chance of winning on 4. It is also the chance of winning 
on 10, since the probabilities of throwing 4 and 10 are equal. The probability of 5 is 4/36, 
and the chance of winning on 5 is 4/36 • 4/(4 -f- 6) = 2/45. This is the same as for 9. The 
probability of 6 is 5/36, and the chance of winning on 6 is 5/36 • 5/(5 + 6) == 25/396, the 
same as for 8. Hence the total probability of winning is 


2 , 2. JL 

9 ^ 36 ^ 45 ^ 396 




Example 5 (Chevalier de Mere^s Problem). Why does it pay to bet consistently on seeing 
6 at least once in 4 throws of a die, but not to bet on seeing double-6 at least once in 24 
throws with 2 dice? 

This is a famous problem in the history of our subject. Betting on dice was a common 
pastime at the French court in the middle of the seventeenth century, and the Chevalier was 
not only a confirmed gambler but also a very observant player and an amateur mathe- 
matician. It seemed to him obvious from common sense that the two events mentioned 
should have the same odds, but it did not seem to work out so m practice. At length he 
consulted Blaise Pascal (1623-1662), who calculated the probabilities. From a corre- 
spondence on this and similar matters between Fermat and Pascal, the whole subject of 
* mathematical probability may be considered to take its rise.® 

The probability of Twt seeing 6 in one throw is 5/6. Since successive throws are assumed 
independent, the probability of not seeing 6 in 4 throws is (5/6)^. The probability of seeing 
6 is 1 - (5/6)^ * 0.516. 

Similarly the probability of seeing double-6 in 24 throws with 2 dice is 

‘-(ir-o* 

The first probability is, therefore, greater than J and the second less. The difference is 
rather small? however, and the fact that it showed up in practice is a tribute to the per- 
severance of the Chevalier de Mere. 


1.9 C<mtinuous Probabilities, In some types of problem it is not possible 
to enumerate the favorable cases and the total cases, because both these are 
infinite in number. In these problems there is a probability f(x) dx that a 
certain variable x will take values between x and x + dx; f{x) is then called a 
probability density. 

- If X take aH values from — oo to oo , and if the range from a to is con- 
sidered a success, the probability of success is defined as 


S' ice 


j f(x)dxjj* f{x)dx^ J fix)dx 

L 


f(x) dx — 1 

00 

♦ Here, and elsewhere, ^^chan^'e^^ is used as a synonym for ‘^probability/^ 
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Sometimes, of course, x can, in the nature of things, take only values in a 
finite range. In such cases f{x) is supposed identically zero for all values 
outside this range. 


Example 6 (Bertrand’s Problem). If a chord is drawn at random in a given circle, what 
is the probability that its length is at least equal to the radius? 

There is no unique answer to this problem, smce it depends upon our interpretation of 
drawing a chord ‘‘at random.” We may, for example, choose one end A anywhere on the 
circumference and then choose the other end B similarly. The assumption here is that, 
A having been fixed, all angular positions of B with respect to A are equally likely. If 9 is 
this angular position, the total range of 9 is from 0 to 27r. The favorable range is from 
7r/3 to Stt/S, so that, f{9) being assumed constant, the probability is 2/3. 

But we might also choose any radius OA and select a point B on it at random, drawing 
our chord through B perpendicular to OA. If a; is the distance OB and r the radius of the 
circle, the favorable range of x is from 0 to V3r/2 and the total range from 0 to r. Hence, 
if f{x) is assumed constant, the probability is V^/2. 

Other answers are also possible. Each answer is correct on its own interpretation and 
each corresponds to some possible experimental set-up. Thus, if a circle were drawn on a 
table-top and a piece of straight wire longer than the diameter were tossed on the table at 
random, the chord being measured whenever the wire crossed the circle twice, the relative 
frequency of success would probably tend to the value V3/2. On the other hand, if a radial 
line were drawn on a transparent roulette wheel, and if immediately beneath the wheel 
there were placed a jQxed circle of diameter equal to the length of the radial line and such 
that the axis of the roulette wheel passed through a point on its circumference, the line 
would sometimes come to rest crossing this circle. If the chord were measured each time 
this happened, the experimental relative frequency would probably tend to 2/3. 

Example 7 (Buff on’s Problem). Parallel straight lines, a distance a apart, are drawn on 
a table, and a thin needle of length Z < o is tossed onto the table at random. What is the 
probability that the needle will cross a line? 

Let 9 be the angle that the needle makes with the lines. All values of 9 from 0 to ir may 
be assumed to be equally likely. 

The needle will cross the nearest line if the distance y of its center from that line is less 
than sin 9. If all distances from 0 to \a are equally likely, the required probability is 



dy d$ 



dyd9 


Tca 2 


I n . 


sin 9 d9 


TTCt 


This giTBs an experimental method of determining t . Lazzerini claimed to have found 
a value 3.1415929 from 3408 trials. The accuracy of this result, if genuine, is a remarkable 
comcidence. 


1.10 Bayes’ Theorem. Let there be n mutually exclusive and exhaustive 
events Ai, Ai, • • • A., that is, at least one of them, and not more than one, ' 
must have happened, but it is not known which one. Suppose also that an 
event B may follow any one of the events A},, with known probabilities, and 
that B is known to have happened. What is then the probability that it was 
preceded by the particular event At,? 

We suppose that the a friori probabilities p(Ai), p{At), • • • p(An) are 
known, and also the probabilities p(B | A^, p{B \ Ai), • • • p{B \ A„). We 
have to calculate p{Ai | B), the a posteriori probability of iL*. 
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Now the probability that Ak happens and is followed by B is y){Ai^ • 
p(S 1 Ak). The probability that B happens, no matter which of the Ak pre- 
ceded it, is n 

'^vi.Ak) • p(5 I Ak) 

Hence the probability that when B happens it is preceded by A a; is given by 

( 1 . 21 ) piAk [ 5 ) = - ^i4l:) .L P . Wj . Ai) _ 

'^p{Ak) • piB I Ak) 

k^l 

This is the famous rule stated by Bayes J 

The rule as it stands is undoubtedly correct but of very limited application, 
since the a 'priori probabilities are seldom known except in artificial examples. 
On the other hand, the assumption made by Bayes that, in the absence of any 
information to the contrary, the a priori probabilities may be taken as equal 
is a highly dubious one. The interested reader may refer to a paper by R. A. 
Fisher ® on ^'Uncertain Inference in which this assumption is criticized and 
an alternative approach suggested. A quotation follows. 

Thomas Bayes’ paper of 1763 \^as the first attempt known to us to rationalize 
the process of inductive reasoning. From time immemorial, of course, men had 
reasoned inductively; sometimes, no doubt, well, and sometimes badly, but the un- 
certainty of all such inferences from the particular to the general had seemed to cast 
a logical doubt on the whole process. By the middle of the eighteenth century, how- 
ever, experimental science had taken its first strides, and all the learned world was 
conscious of the effort to enlarge knowledge by experiment, or by carefully planned 
observation. To such an age the limitations of a purely deductive logic were intoler- 
able. Yet it seemed that mathematicians were willing to admit the cogency only of 
purely deductive reasoning. From an exact hypothesis, well defined in every detail, 
they were prepared to reason with precision as to its various particular consequences. 
But, faced with a finite, though representative, sample of observations, they could 
make no rigorous statements about the population from which the sample had been 
drawn. 

Bayes perceived the fundamental importance of this problem and framed an axiom, 
which, if its truth were granted, would sufiBce to bring this large class of inductive 
inferences within the domain of the theory of probability; so that, after a sample 
had been observed, statements about the population could be made, uncertain in- 
ferences,. indeed, but having the well-defined type of uncertainty characteristic of 
statements of probability. Bayes’ technique in this feat is ingenious. His predeces- 
sors had supplied adequate methods, given a well-defined population, for stating the 
probability that any particular type of sample might result. His problem was: given 
a particular kind of sample, to state with what probability a particular type of popula- 
tion might have given rise to it. He imagines, in effect, that the possible types of 
population have themselves been drawn, as samples, from a super-population, and 
his axiom defines this super-population with exactitude. His problem thus becomes 
a purely deductnx one to which familiar methods were applicable. 
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Example 8. A bag contains 10 balls, either black or white, but it is not known how many 
of each, A ball is drawn at random and is white. What is the probability that the bag 
contained at least 5 white balls originally? 

To start with there are 11 possibilities (the number of white balls may be 0, 1, 2, 10). 

Let the corresponding a priori probabilities be po, Pi, pio- 

The event B is the drawing of a white ball. If the bag contained k white balls, 

p(B I A,) = 


Therefore the probability of k white balls initially is 


and the probability required is 





This cannot be evaluated unless the pk are known. 

(a) Assume that all values of k are equally likely, before the ball is drawn, 
for all k. Hence 


7> 



11 


= 0.82 


Then pk = 


(b) Assume that the bag was filled originally by picking 10 balls at random from a mixture 
of a very large number of black and white balls in equal proportions. The probability of k 

white and 10 — A; black is then proportional ^ Therefore 


p = 



3770 

5070 


= 0.74 


Note that on assumption (b) the a priori probability of at least 5 white balls is 



0.62 


This is increased by the additional knowledge given by the drawing of a single white ball to 
0.74. 


Since the values of pk are unknown the problem does not have a unique 
solution. Moreover, if they were known we should be back in the domain of 
deductive probabilities again since ail the probabilities in the right-hand 
member of (1.21) would then be known a priori. It is only when p{Ai^ are 
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unknown that wo arc properly in the donaain of a 'posteriori probability. In 
practical problems the p{Ak) are scarcely ever known. 

Bayes realized this and argued that the AkS may be considered equally 
probable unless we have some reason to think they are not. Under this 
^Moc trine of insufficient reason/’ the 4/b’s are assumed to have equal existence 
probabilities. In this case, p{Aj^ = constant and would cancel in (1.21), thus 
permitting a definite solution in Example 8. It appears that Bayes had serious 
doubts about this '^doctrine ’’.for he withheld his entire treatise from publica- 
tion until his doubts should be resolved, and it was only after his death that 
his paper was published by friends, Laplace, however, was less cautious, and 
he incorporated the doubtful theorem into his TMorie Anal'ytique des Proha-- 
bilites. Robed in the authority of Laplace it went unquestioned for a long 
time. Boole was the first, in 1854, to criticize the assumption of 'Hhe equal 
distribution of our knowledge, or rather of our ignorance” and “the assigning 
to different states of things of which we know nothing, equal degrees of prob- 
ability.” Today, it is well known that the assumption of constant existence 
probabilities may lead to mathematical contradictions. This may clearly be 
seen in the following example, given by R. A. Fisher.® 

For a continuous variable, Bayes’ theorem (1.21) may be written 

( 1 . 22 ) 

j 1 x) dx 

whemfix) is the a priori probability density of x and p{B | x) is the probability 
of the event B when x has an assigned value. 

Let 6 be the probability that a random individual from an infinite popula- 
tion has a particular characteristic, which we call “success.” The probability 
that in a sample of n the first r individuals selected are successes and the next 
n — r failures will be ^”(1 — 0)^“*'. Since the r successes and n — r failures 

can be rearranged in ways, the probability of r successes and n -- r failures 
in any order, is 

pO- i 6) = (”) e^(i - e)”-- 


Hence the probability density of 9, given r, is 


(1.23) 


p(e I r) = 


f(9)p(r I 9) 




f{9)pir I 9) d9 


If we assume that all values of $ from 0 to 1 are equally probable^ a priori, 
we can put f{B) == 1, and then 


(1.24) k) = Ce^{l - 

where C depends on the integral in (1.23). 
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If 6 IS an estimate of so chosen as to make this probability a maximum, 
we easily find by equating to zero the derivative of ^'’(1 — with respect 
to 0 that § = r/n. 

But by the very nature of the doctrine of insufficient reason we have no 
more reason to assume all values of 6 to be equally likely than we have to 
assume the same thing for some function of 6, say <j) = (20 — 1). Since 

d(t>/dd = [0(1 — a constant probability density for ^ means a probability 

density for 0 proportional to (1 — 0)~^/-. Infact/(0) will be — 0)”i^2 

7r~b in contradiction to the previous assumption. 

Bayes’ rule then gives 

(1.25) p(6 I r) = 

and the estimate of 0 found by maximizing this is 0 = (2r — l)/(2n — 2). 
Clearly, any number of different estimates could be obtained by choosing 
different functions 4>j and logically one is as good as another, on Bayes’ argu- 
ment. Unless, therefore, there is some cogent reason, depending on the 
physical circumstances of a problem, for regarding one variable rather than 
another as having a constant probability density, Bayes’ rule, as commonly 
applied, is an unsafe basis for statistical inference. As we shall see later, 
statements can be made about population parameters (such as 0), subject to 
assigned risks of being wrong, without the necessity of making any assump- 
tions regarding a priori probabilities, 

1.11 Bayes’ Theorem for Future Events. Bayes’ theorem maybe extended 
to the probability of future events, as follows. Let C be an event which occurs 
after B and which itself follows one and onl}^ one of the events Ai, Ag, • • - An. 
It is required to calculate the probability of C when it is known that B has 
happened. 

The probability that B happens is 

^piAk)p(.B 1 Ak) 

The probability that B happens and is followed by C is 


1 A,)piC 1 A,, B) 

Jb«l 


Hence the probability that C happens if B happens is 


( 1 . 26 ) 


P(.C\B) = 


2pi^'‘)p(B I A,)p{C i A,, B) 

k 

'2,pU,)piB I AO 


Example 9. With the conditions as stated in Example 8, what is the probability that if 
a second ball is drawn at random (without the first ball being returned to the bag) it also will 
be white? 
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Here events B and C both consist of the drawing of a white ball. 

V{B I A,) = p(C I At, B) = ^ 


viC I £) 




■ Hh - 1) 
90 


Pa 


10 

^)Vk 


9‘ 


10 , "9 10 
A=i 1 

If all compositions are assumed equally likely a priori, pk — xV> 

10 

ey 


ViC 1 = 


1 1 


9 ^ 

1 


In Example 9 we have used the algebraic formulas: 

(1.27) = ^N{N + 1) 

(1.28) = \N{N + 1)(2JV + 1) 

*.=1 

For reference we may add the further results: 

N 

(1.29) 2*' = + 1)' 

(1.30) = ^-^N{N + l)(2iV + l)(3i\r2 + 3^■ - 1) 

Problems 

1. Prove both algebraically and verbally that 
(a) P(n, r) = C{n, r)P(r, r), (6) (”) = („ ” J 

2. From among nine men A, B, C, D, E, F, G, H, I, a committee of four men will be 
chosen. The ni ne names will be written on nine separate cards and four cards drawn at 
random one at a time from a box. 

(а) In how many different ways may the four cards come out? Ans. 3024. 

(б) How many different committees are possible not including the man A? Ans. 70. 

3. Consider the word introduce.'^ (a) In how many of the possible arrangements of 

all its letters wOl there be a consonant in the first place? Aris. 201,600. (Z>) From its letters 

how many four letter permutations consisting of three vowels and one consonant can be 
formed? Ans. 480. (c) If five of its letters are selected at random what is the probability 

that two are vowels and three are consonants? Ans. 10/21. 

4. On a table there are four different biographies with brown backs and seven different 
novels with red backs, (a) If all of the books are placed upright in a row on a shelf, in how 
many different ways may they be arranged so that the orders of the colors are different? 
Ans. 330. (b) In how many different ways may two of the biographies and three of the 
novels be selected and arranged on the shelf so that the orders of the books are different? 
Atis, 25,200. 
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6. In a box there are five red billiard balk with the numbers 1, 2, 3, 4, 5, painted on 
them (one on each ball), and three white billiard balls with the numbers 1, 2, 3, similarly 
painted on them. From the box a man draws two balls at random, (a) What is the prob- 
ability that one of the balls drawn is white and the other is red? Arts, 15/28. (6) What 

is the probability that the two balls drawn have either the same color or the same number? 
Am. 4/7. 

6. A bag contains four white, five red, and six black balls. Three are drawn at random. 

Find the probability that (a) no ball drawn is black, (6) exactly two are black, (c) all are of 
the same color. ^ 

7. An urn contains four white and five black balls. Three balls are drawn at random 
and replaced by green balls. If then two balls are drawn at random, what is the probability 
that they are all of the same color? Am. 29/108. 

8 . Write out the expressions for 

9. Write in expanded form: 



10. Twelve cards have been dealt, six down, and the other six showing a jack, two longs, 
a seven, a five, and a four. What is the probability that the next card will be a four or less? 
(National MathcTnatics Magazincj 13, p. 94.) 

11. From an urn containing ten balls, numbered from one to ten, balls are drawn, one by 
one and placed in a row of holes, numbered from one to ten, each ball being placed in the 
proper hole. What is the probability that there will not be an empty hole between two filled 
ones at any time of the drawing? (Armrican MatJwmahcal Monthly, 46, p. 635.) Ans. 
2/14,175. 

12. Use equation (1.4) and the identity (x -f- y)'^{x + yY = (a; H- ^)»»+» to prove that 

lot -,)=(” r) 

where ^ = 0, 1, 2, m A- n. 

13. From the result of Problem 12 prove that 

14. Prove that the probability that some one of the hands of cards in a particular biidge 
deal contains all 13 cards of a suit is about 1 in 40 thousand millions. 

The fact that more cases are reported of hands of this character than seems reasonable 
from the extremdy small probability, appears to be due to the imperfect shufiBl^pg in actual 
play. 

16. Work out Example 9, § 1.11, on the assumption (6) of Example 8. 

16. A bag contains 10 balls, black or white, but it is not known how many of each. A 
ball is drawn at random, looked at and replaced, three times, and each time it is white. 
What is the probability that a fourth drawing will also give a white ball? 

17. A bag contains six balls, identical except for color, and known to be either white or 
black, A ball is drawn, looked at, and replaced, and the balls are then shaken up. This is 
done four times, and three times the ball is white. What can be said about the contents of 
the bag, and about the chance that if a fifth ball is drawn similarly it will be white? 
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18. A card is drawn at random from a deck and replaced, then a second drawing is made, 
and so on. If the cards are well shuffled each time, how many drawings must be made in 
order that there may be at least an even chance of seeing the ace of spades? Arts. 36 or more. 

19. Five cards are drawn at random from a deck 1000 times. How many times would 
you expect to get: (a) 5 of one suit; (&) 4 of one suit; (c) 3 of one suit, 2 of another; (d) 3 of 
one suit, 1 each of two others; (c) 2 of one suit, 2 of another, 1 of a third; (/) 2 of one suit, 
3 of different suits. Note that the expected number for any combination is 1000 times the 
probability of that combination. Ans, (a) 2, (6) 43, (c) 103, (d) 223, (e) 365, (/) 264. 

20. There are 3P students in a class. Evaluate the probability that at least two have the 
same birthday. (Assume that the year contains 365 days, all of them equally likely as 
birthdays.) Ans. 0.706. 

21. Assume that (a) a Shakespeare sonnet contains about 600 letters (including punctua- 
tion marks and spaces); (b) a typewriter has 42 keys; (c) a monkey can strum on the type- 
writer at a speed of 300 letters a minute; show that the time that may be expected to elapse 
before one of the six mofikeys mentioned in the quotation on page 1 produces any one of the 
approximately 150 Shakespearian sonnets is of the order 10®®® years. (The estimated total 
life of the solar system, according to G. Gamow, is of the order 10^® years ) 

Hint. The expected time in years is the reciprocal of the probability that the event occurs 
in any one year. 
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CHAPTER II 


THE BINOMIAL DISTRIBUTION AND THE NORMAL AND POISSON 

APPROXIMATIONS 

1 

2.1 The Binomial or Bernoulli Distribution. Suppose that a sample of s 
individuals from a given population is divided into two groups according as 
the members have or have not a certain attribute. Such a division is called 
a dichotomy. For example, the division may be into “heads’^ and tails’^ 
for a population of coin tosses, or ^^male” and “female^’ for a population of 
children, where for this purpose ^Hail^' is regarded as synonymous with 
‘^not-head’^ and ^^female’^ with “not-male.’^ The occurrence of the attribute 
is frequently, and conventionally, called a ‘^success.” If x individuals have 
the attribute in question, x is an integer which may take any value from 0 to s 
inclusive, and x/s\s called the relative frequency of success. The process of 
selecting an individual is often called a ^Hrial.’^ 

The proportion p of individuals having the attribute in the parent popula- 
tion may be taken as the probability that any one individual selected at ran- 
dom has this attribute and, if the parent population is very large compared 
with 5, p will not 'change appreciably in the process of selecting the whole 
sample. Then the probability that the first x members have the attribute 
and the remaining 5 — a; do not is p*(l — Since the number of ways 

of rearranging the sample without changing its composition is ( ^ J, the proba- 
bility of X successes in s trials is given hy ^ ^ 

(2.1) j){x, s) = 

where q is written for 1 — p. This is the same as the term containing p® in 
the expansion of the binomial {q + p)*. 

The probabilities that the attribute occurs 0, 1, 2, * s times in a sample 

of s are, therefore, given by the successive terms in (g + p)^ This result was 
obtained by J. Bernoulli and was published posthumously in 1713^ in a work 
called Ars Conjectandi, A discrete frequency distribution with frequencies 
proportional to these probabilities is, therefore, called a Bernoulli or binomial 
distribution. If we take N sets of s trials each, the theoretical absolute fre- 
quencies are Npix^ s) and hence may be integers if p is a rational fraction and 
N is suitably chosen. 

In evaluating p(Xy s), it is convenient to use tables of logarithms of factorials. 
From (2.1) we have 

log p(Xj s) = log s! — log a:! — log (s — x) ! + log p + (s — x) log q • 

22 
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Seven-figure tables of log n! for n = 1 to n = 1000 are given in Glover’s 
Tables of Applied MathematicSy and in Pearson’s Tables for Statisticians and 
BiometricianSj Part I. 

An alternative systematic procedure for calculating p(Xj s) is suggested in 
Example 1, § 2.10. 

Extensive tables of the binomial distribution are now available, giving 
individual terms and the accumulated sums of terms. (See the footnote to 
§ 6.15.) 

2.2 Graphical Representation. A binomial distribution may be represented 
graphically by a histogram. This is accomplished by constructing rectangles 
centered at a: = 0, 1, 2, • • • 5 with heights proportional to the terms of the 
binomial. Since the values of x constitute a discrete series it might seem more 
logical to represent the relative frequencies 
by ordinates instead of rectangles. How- 
ever, since the base of each rectangle is unity 
the number representing its height is also its 
area, and the representation by areas will be 
useful in our work. 

If we are thinking of relative frequencies 
or probabilities the sum of the areas of all 
the rectangles is unity, whereas if we are 
thinking of absolute frequencies the total 
area of the histogram is N. Thus if six 
coins are tossed 64 times the theoretical ab- 
solute frequencies are given by the terms of 
64(f ■+ I)®. These are 1, 6, 15, 20, 15, 6, 1 and their sum is 64. 

It is often convenient to think of a frequency distribution as a distribution 
of mass. Here the masses (proportional to the frequencies) are concentrated 
at the points x = 0, 1, • • • s along the x-axis. 

2.3 Frequency Functions. Stieltjes Integrals. The notion of frequency 
functions relates to theoretical universes. The concept is an idealization of 
observed distributions comparable to the idealization of the outlines of material 
objects into the straight lines and circles of geometry. 

A continuous variable x is said to have the frequency function /(a:), which 
we take to be single-valued and non-negative, if the frequency of occurrence 
of X in the range a < a; < 6 is measured by 

(2.2) f(x) dx 

If X has the frequency function f(x) \\ith total frequency N, then 

(2.3) J^ix) dx^N 



0 1 2 3 4 5 6 

Fig. 1. Histogram op (2 + J)*’ 
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and y = f{x) is called a theoretical frequency curve or, more briefly, Sk, frequency 
curve. If the actual occurrence of the variable is limited to a finite range, 
f(x) is defi.ned to be identically zero outside that range. If the total area 
under the curve is taken as unity, so that 

(2.4) J fix) dx = 1 

then y = fix) is variously called the probability density, the probability dis- 
tribution, or the probability function of x. Then, fix) dx gives, to within 
infinitesimals of order higher than that of dx, the probability that x lies in the 
interval ix, x + dx). Under condition (2.4), the integral (2.2) denotes the 
probability that x lies in the interval (a, 6). Under condition (2.3), (2.2) 
denotes frequency of values in the interval (a, i>). A suitable function can 
be regarded, therefore, either as a frequency function or as a probability func- 
tion according as condition (2.3) or (2.4) is imposed. The distinction can be 
adjusted by determining appropriately a constant factor in 2/ = fix). If the 
variable x has only discrete values and if fix) is the probability of occurrence 
of the value x, then 

(2.5) 2/(x) = 1 


the sum being over all the admissible values of x. 

The continuous and discrete cases may conveniently be included together 
by writing the left-hand side of (2.4) or (2.5) as a StieUjes integral 

(2.6) J” dF(x) = 1 

where this integral is interpreted as I fix) dx when x is continuous and as 
^fix) when x is discrete. It also covers cases where fix) is partly continuous 

X 

and partly discrete, although we shall not encounter examples of this nature. 
The Stieltjes integral is not merely a convenient short-hand device. It plays 
an important role in the more advanced treatment of mathematical statistics. 
The reader may refer to an expository article by J. Shohat ^ for a further 
discussion of this integral. 

2.4 Distribution Functions. The function Fix) that appears in (2.6) is 
called the distribution function, or, when multiplied by N, the cumulative 
frequency function. It may be written 


(2.7) 



and is interpreted as fix) dx it the distribution possesses a frequency 
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function or as 2/W distribution is discrete. Clearly F(x) is a non- 

negative, non-decreasing function which is zero at a; = — cx) and 1 at a: = +oo . 
For a discrete distribution 
it increases in steps at 
the values of x for which 
the frequency function is 
not zero, and is continuous 
on the right. Thus for 
the binomial distribution 
of Figure 1, the distribu- 
tion function is of the form 
shown in Figure 2. For 
a continuous distribution, 

F(x) is usually more or less Fig. 2 

of the form indicated in 
Figure 3. A variable x for 
which a distribution func- 
tion exists is often called a 
variate. 

2.5 Mathematical Ex- 
pectation. If there is a 
probability f{x^ that a 
variable x will have the 
value Xi, f = 1, 2, • * • n, 
and if the set of values Xi 
includes all possible values 
of a:, the expected value or expectation of x is 

n 

E(x) = 

= 1 

Thus, if I buy a ticket in a lottery in which there is a prize of $1000 and ten 
prizes of- $50, and if 10,000 tickets are sold, my expectation, or the expected 
value of my ticket, is 


( 2 . 8 ) 

where, of course, 




10,000 


X 1000 + 


10,000 


X50 + 


10,000 


X 0 = $0.15 


or 16 cents. If in a gambling game the expectation is equal to the stake, the 
game is said to be "fair.” The fair price for the lottery ticket mentioned 
would be 15 cents. Actual gambling games and lotteries are usually not fair, 



26 


The Binomial Distribution 


n 


in this sense, since a substantial percentage is taken by the organizers or the 
“bank.” 

If the variable x is continuous and if f(x) dx 'is the probability of a value 
between x and x + dx, the expected value of x is 


(2.9) 

where 


Eix) = I xf{x) dx 


I 


f(x) dx ^ 1 

oo 


provided tlie integral on the right of (2.9) exists. (An example where it does 
not exist is given in § 2.7.) 

If the variable x has the distribution function F(x)j whether continuous or 
not, its expectation is given by 


( 2 . 10 ) E(x) = J"xdF(x) 

which is equivalent to (2.8) and (2.9). 

If in an actual sample^ of N individuals there are iVt with the value Xi 
(i *= 1, 2, * • * n), the arithmetic mean of the sample is 


( 2 . 11 ) 


X = 



As N increases, the ratio Ni/N approximates more and more closely in the 
stochastic * sense to the probability /(x^) that a randomly chosen individual 
of the parent population will have the value x^. Hence as N increases, x tends 
to the value E(x ) . In other words, the expected value may be regarded as the 
mean of the hypothetical parent population characterized by the theoretical 
frequencies /(a: i). 

2.6 Moments. The expected value of a function g{x) is defined as 

(2.12) E{g{x)] £ gix) dFix) 

provided the integral exists. 

If g{x) — x^j where r is a positive integer, (2.12) defines the rth moment about 
the origiuj Vr, Clearly vi = E{x)j the population mean. If g{x) = (x — ri)**, 
the same equation defines the rth moment about the mean, /Xr. That is, 


(2.13) 

(2.14) 


Vr = 

}^T = 



x^dF(x) 

(x - viYdFix) 


* That is, in the sense of the theory of probability rather than in the sense of the mathe- 
matical theory of limits (see § 1.3), 



Sec. 6 


Moments 


27 


Obviously, /xo = 1 and in = 0, whatever the distribution, provided vi exists. 
In particular, for r = 2 , the variance is defined by 


1X2 = 0-2 



{x ™ viYdFix) 


where <r is called the standard deviation. If the variable is changed from x 
to T, where 

(2.15) * T = (a; — vi)/<T 


the distribution, expressed as a function of r, is in standard fornij and r is 
called a standardized variable. The variable r is a pure number, since x and <r 
are in the same units. A distribution in standard form has mean 0 and 
standard deviation 1 . 

By expanding {x — in powers of x, it is readily proved from (2.13) and 
(2.14) that 

fi2 = V2 — 

fxz = Vz — Sv2Vl + 2 vi^ 

m = V4 — 4 :VzVl + 6v2Vi^ — OtC. 

and, in general, 

(2.16) IXr = Vr — Vr^iVi + Vr-.2Vl^ . 

+ (- D* ([) +•••+(- 1)'-' {(^ I i) - l} 


The moment-measure of skewness of a theoretical distribution is defined as 


(2.17) 



It is a pure number, 0 for a symmetrical distribution, positive for a distribu- 
tion with a long tail to the right (because of the high contribution to (x — viY 
for large positive values of x) and negative for a distribution with a long tail 
to the left. It is sometimes denoted by yi or ± Vft, the former being Fisher^s 
notation and the latter Karl Pearson’s. We shall generally use Fisher’s 
notation. 

The moment-measure of kurtosis for a theoretical distribution is defined as 


(2.18) • ^ 


OtA 




This is also a pure number. In Pearson’s notation it is called ^ 2 * It used 
to be regarded as a measure of peakedness” but is now recognized to be 
greatly influenced by the behavior of the ‘Hails” of the distribution.^ Since 
for the normal (Gaussian) distribution, the value of is 3, the excess of 0:4 
over 3 is often taken as a natural measure of kurtosis. This quantity, 
Y 2 == ^4 “ 3, is sometimes called the excess. 
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On the hypothesis that a parent population has a certain theoretical dis- 
tribution, the moments of a sample from that population will be approxima- 
tions to the moments of the theoretical distribution. We shall see later, when 
we come to discuss sampling theory, how the moments, skewness, or other 
parameters of the parent population may, in some cases at least, be estimated 
from the corresponding values for the sample. 

2.7 The Cauchy Distribution. The Cauchy distribution is a classical ex- 
ample of a probability distribution although its use in present-day statistics 



Fig. 4 The Cauchy Curve 


is relatively unimportant. Its equa- 
tion is 


(2.19) 


b 

^ iri¥ + 

— 00 < a; < 00 , 


6 > 0 


The curve is symmetrical, having 
its center at a; = 0. 


A simple derivation of this function is as follows. For a given real constant 
h locate the point (0, h) as in the figure below. Let lines be drawn at random 
through (0, b) and let 6 be the variable angle between any such line 'and the 
negative direction of the ^/-axis; d 
varies between the limits — 7r/2 and 
7r/2. The h3q)othesis is that all 
values of 6 in this range are equally 
likely. Denote the intercepts on the 
horizontal axis by x. Clearly, 

— 00 < a; < 00 . 

The relation between 6 and x is 

6 - tan~^f 
0 



Under the hypothesis, the probability that an angle Obx will be contained 
between $ and 0 •+• is d$/v. By differentiation we find the relation be- 
tween dd and dx to be 


( 2 . 20 ) 


M _ b dx 
TV TvQy^ -f- a;^) 


Therefore; the points of intersection' of the lines with the x-axis are distributed 
so that the probability that a value of x \vill fall in the range dx is given by the 
right-hand member of (2.20). Hence the probability function for the variable 


X is ^ 


and the probability that x lies in a finite interval (c, d) is given by 
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Pic, d) = 


X 


h dx 


iri¥ + x^) 


since the integral of the right-hand member of (2.20) from — oo to <» is equal to 
unity, as can easily be verified. However, we cannot speak here of the meaa. 
value of X or of moments of higher order, since the integral 


3^ dx 
(¥ a:“) 


has no meaning iovk — 1, 2, • • • . 

If a machine gun were mounted on a swivel at (0, h) so as to be perfectly 
free to turn, and if it fired bullets into a very long straight wall stretching along 
the a:-axis, the distribution of bullet holes would be a Cauchy distribution, 
on the assumption that all angular positions of the gun were equally likely. 

2.8 Moments of the Bernoulli Distribution. By definition. 


( 2 . 21 ) 

where 

Then 


Since 

( 2 . 22 ) 

Sinailarly, 


Hence 

(2.23) 


Vi = Eix) = ]^x,/(x,) 

X, = 0, 1, 2, • • • s, Six,) = QJ 




'X(k- 


sl 


f (A - l)!(s - A:)! 


pkqB- 


^ Z ik - l)l{s - k)l^ ® 
= spiq + p)*-x 
S + p = 1, 

VI = Eix) — sp 


= '^{kik-l) + k\ ^,^^ p^r-^ 

“ ? ik - 2)!(s - ik)! 

= sis — 1 )pKq + + sp 

= sis — l)p* + sp 

_ g2p2 ^ gpg 


P2 = Eix — sp)* — Vi — n® = sp^ 
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Highesr moments may be found similarly. They may also be calculated from 
the recursion formula, due to Romanovsky, 


(2.24) 


Mr+l = M STflr-l 


[ dP’r 


A simple proof of (2.24) has been given by A. T. Craig.® Note that jur is here 
expressed as a function of s and q by means of the relation p = 1 — g. 

Thus, knowing that /xo = 1, /.ti = 0, wre have: 


li II 

P2 = 

M3 = 

A; = 3, 

M4 = 

Hence 

= 

(2.25) 7x 

= as = 

(2.26) 

ai = 

(2.27) 

72 = 


pq[Q - (s - 2sg)] 

spq(2q - 1) = spqiq - p) = sg(l - q)(2q 
pq{Zs^pq + — 6sg + s) 

spq{Zspq — Gpq + 1) 

S25g[l Zpq(s - 2)] 

Q-P 


"vspq 

1 I 3(a-2) 
spq s 

O 1 6 

^4 ““ O 

spq s 


1 ) 


For the symmetrical binomial curve, for which p 

2 


? = i 


(2.28) 


7i = 0, 


72 = “• 


Equations (2.22) and (2.23) give the mean and variance with respect to 
the number of successes a; in s trials. In some statistical investigations the 
data are expressed in terms of percentages or rates. When we may assume 
a constant probability underlying the frequency ratios obtained from observa- 
tions we have a binomial distribution as before but on a different scale. In- 
stead of the variable being x it is now x/s. In tliis case we have 

(2.29) £(S) -!£(«,) -f-p 


For the analogous concept relating to the variance we have 

(2.30) 

Therefore, we see from (2.22) and (2.23) that the number of successes per set 
of s trials is distributed about an expected value sp with a standard deviation 
(spqy^^. From (2.29) and (2.30) we see that the proportion of successes in a 
set of s trials is distributed about an expected value p with a standard devia- 
tion ipqjsyf^. 
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It is important to observe that for a fixed value of p the standard deviation 
of X increases as s increases and is proportional to whereas the standard 
deviation of xjs decreases as s increases, since it is proportional to (l/s)‘^^. 


Problems 


1. A dust storm contains particles of two kinds identical except as to color, brown and 
yellow particles existing in the ratjo 3:2. If. five particles of this dust enter my eye at ran- 
dom determine the probability that two of them are brown and the other three are yellow. 
(See American Mathematical Monthly, 41, 1934, p. 337.) 

2. Six coins are tossed once, or what amounts to the same thing, one coin is tossed six 
times. Find the probability of obtaining heads 

(a) exactly three times 

(6) at most three times 

(c) at least three times 

(d) at least once. 

3. (a) What is the probability of throwing seven in a single toss of two dice? 

(b) In six tosses of two dice find the probability of throwing seven at least once. 

4. Toss six coins 64 times and record the number of times heads appear 0, 1, 2, 3, 4, 5, 6 
times. (Instead of tosses, the coins may be shaken in a box.) Compare the resulting 
distribution of frequencies with the terms of the expansion of 64 (J + ^)®. 

6. A bag contains white and black balls in the proportion 2:3. Let the drawing of a 
white ball be called a success. Three balls are drawn separately and after each drawing 
the ball is returned to the bag and thoroughly mixed with the others so that the funda- 
mental probability of success remains constant during the trials. Find the probabilities of 
0, 1, 2, 3 successes. If this experiment were repeated 125 times what is the theoretical 
frequency of each of the possible number of successes? 


6. (a) Find the values of 



fora;=0toa; = 18 inclusive. 


(To the instructor: 


PascaFs Triangle provides a simple scheme for constructing a table of binomial coefficients.) 

(b) Evaluate 2®/3“ for a: — 0 to x = 18 inclusive. 

(c) Show that (J + i)“ may be written 

^/(x) where fix) = 2»’/3“. 


(d) Using the results of (a) and (6), find the values of x = 18. Save 

your results for future reference. 

7. Expand the binomial iV'(i + f)* for a = 2 and s = 8. Find the theoretical fre- 
quencies in each case by taking N as the smallest number necessary to express the terms of 
each expansion as integers. ' 

8* Find the mean and standard deviation for each of the distributions in Problem 7. 

9. Find x, a, m, m for each of the following binomials; 

(i + wAi + m ii + m 


10. For a certain binomial distribution 

tr = 2.66, as == 0.318. Find p, q, and a. 

11 . Assume that .04 is the theoretical rate of mortality in a certain age group. Suppose 
an insurance company is carrying « = 1000 such cases. What is the expected dispersion 
(standard deviation) in death rates from the theoretical rate p = .04? What would it be 
if « =* 10,000? 



32 


The Binomial Distribution 


n 


12. The value of z for which is the largest is called the mode of a Bernoulli dis- 

tribution. Show that the mode is the positive integral value (or values) of x for which 

sp--q<x^$p+p 


Bint. Using equation (2.1), write down the conditions that p{x -i-l, 8)/p{x^ 3) < 1 and 
p{x — s)/p{Xj s) ^ 1 simultaneously. 

13. Suppose the law of distribution of the happening of an event in s successive trials is 
given by the terms of the expansion of 

(3 + p)* = ^ ( j pV’' = 


(а) If 8 == 100 what values of p and q will make Po == Pij P9 == Pio? 

(б) Give approximate values of the P's in (a). 

14. A tosses three pennies and B two, and the winner is the one with the greatest number 

of heads. In case of a tie they continue until a definite decision is reached, (a) What are, 
at the start, the respective probabilities of winning in a single game (a game is a set of tosses 
leading to a decision)? (b) How much money should A put up on a game to each dollar 
that B puts up, to make the game fair? An$. (a) -3^; (6) $2.67. 

Note that here our collective, to which the probabilities relate, is an infinite sequence of 
events each one of which consists of a finite or infinite number of tosses. The probability 
that any one game will continue indefinitely is, however, zero Moreover, in almost every 
long sequence of these games the number of games will be large compared with the largest 
number of tosses in any one game. 

15 . A tosses a coin, agreeing beforehand to give B two dollars if it turns up ^‘head," four 
dollars if ^'head" does not turn up till the second toss, eight dollars if not until the third toss, 
and in general dollars if x “tails" in succession appear before “head" appears. If, 
however, “tail" should appear ten times running, the game will end there with B receiving 
2^1 dollars. Assuming that the com is unbiased and the tossing random, what is P's ex- 
pectation? Am. $12.00. 

This is a variant of the famous “Petersburg paradox," on which a great deal has been 
written. The paradox arises from the fact that, without the agreement to stop at 10 tosses 
in any case. P's expectation, contrary to common sense, is infinite. He has an extremely 
small chance of winning an enormous amount, an amount which would certainly be beyond 
any A's capacity to pay. 

We can, however, define a “fair" game, as Feller^ has shown, in a way which leads to a 
definite solution of the Petersburg problem. 

Let Bh be P's entrance fee on the kth trial (a trial being a set of tosses, however long, end- 

n 

ing with the first head) and Tn = ^.Pfe, the accumulated fees up to the wth trial. Let Pn 

be the sum of P’s winnings up to the nth trial. Then if for any given € > 0, the probability 
that I Pn — I < eTn tends to 1 as n — > 00 , the game can be called fair. The value of Bh 
turns out to be log2 h, so that T« = log2 or approximately (n logen — n)/ioge2 when n 
is large. Thus, for n = 100, Tn is about $520, corresponding to an average of $5.20 per 
game. 

This definition of a “fair” game depends on what happens in the long run (as n — »• 00). 
The assumption is that P is willing to play a very large number of games, large enough to 
make it practically certain that if the game is fair | P» — Tn j will be very small compared 
with Tn. 

The subject of games has received much attention from mathematicians in recent years, 
partly because of a parallel to certain economic situations (see The Theory of Games and Eco- 
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nomic Behaviour by J. von Neumann and 0. Morgenstem)^ and partly because of its impor- 
tance m modern warfare. 


2.9 Approximating the Binomial with the Normal Curve. If we plot the 
terms of {q + p)* as ordinates against the values of x/V s as abscissas and draw 
the corresponding histogram we 
find that it approaches a smooth 
curve as s is taken larger and larger. 

Thus in Figure 5 (where the va*tical 
sides of the rectangles are omitted 
since they contribute nothing to the 
interpretation) we see how the stair- 
case outline of the histogram ap- 
proaches closer to a continuous 
curve as 5 is taken larger. 

The limiting values of 71 and 72 
for the binomial as 5 00 are those 

of the normal curve. Thus from 





and 


(I - V 

^ _ 6 
spq s 

we see that 71 0 and 72 — » 0 as 

>00. This suggests the pos- 
sibility of approximating the bino- 


7i = 



Fig. 5. Showing Approach of (q + p)« to 
Smooth Curve as s 00 


mial with the normal curve. As a matter of fact, it can be proved, under 
certain conditions of approximation, that (q + pY approaches the normal 
curve as a limit as s — » 00 . A complete proof will not be given here but a word 
or two about it may be appropriate. In using the normal curve to approxi- 
mate the binomial we are particularly interested in a range of three or four 
standard deviations from the mean. This fact suggests the reasonableness of 
assuming that the number of successes x' above or below sp be considered as 
the same order of magnitude as <r. This means that x'/(spqy^^ will remain 
fimte as s — > 00 . Now (spqY^^ is of order if neither p nor q is ex- 
tremely sfnall. Hence the propriety of assuming (in the proof) that xV(a)^/^ 
will remain finite. This is the reason for plotting the histograms (Figure 5 ) 
in terms of 

We may expect, therefore, that the fitted normal curve wHl give a fair 
approximation to the binomial except possibly at the extremities of the range. 
When the terms of the binomial are arranged symmetrically with respect to 
the mean, that is, when p = g, the approximation is considerably better than 
when either p or g is small compared with the other. 
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Because of the central role played by the normal law in statistical theory, 
we append a proof that the hmiting form of the binomial curve is the normal 
curve. The proof is, however, suggestive rather than 
Q rigorous. 

The variable is first changed to where 
i={x- sp)l(r, <7 = 

R 

so that the “step” of f is \/a. This step becomes in- 
definitely small as s — > oo . The slope of the straight 
line joining the tops of two successive t ordinates is 
T > then equated to the slope of a continuous approximat- 

ing curve at the midpoint. To maintain the area 
under the curve unaltered, the ordinates are multiplied 
by a- at the same time as the abscissae are divided 
by <r. The slope of PQ is 

M N NQ-MP f(t + lU)-f(t) 

t t+A 


(X+1) 

Fig. 6 


where 


/(0 = 


Hence 


a;!(s — x)! 

pjx -f 1, s 
p(x, s) 




Sp -f- fft 


X -f- 1 


f (t+l/ a-) — fit) ^ (s - x)p - (x -f- l)g 
/ (f -4- l/<r) -f /(<) (s ■- x)p -f (x + l)g 
_ sp — q — X 
sp + q + (q — p)x 

Q + ert 

q + 2(r^ -{■ {q — p)<rt 

Now the ordinate of the approximating curve midway between t and f + l/v 
is nearly §[/(< -f l/<r) -f /(<)], so that, if p is the ordinate of this approximating 
curve, we can suppose that 

(im = un, / ^pe ofPQN 
W ctt, /i+i/ 2 <r a-^ao \mid“Ordmate/ 

= lim 

l{/ + lA) +/(0} 

= lim 

ir->-qo ^ “h 2<r^ *4" 


W dt ]t 


= lim 


ir->-qo ^ “h 2<r^ *4" — p)i 

Substituting t — l/2£r for we have 

= lim — =g y - (g . +-°^ 
\ydtjt <r— » 


g -f 2«r* -t- (g 
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n — 00 

-t 


V . 


1 + 


so that 


or 

(2.31) 


= lim 

<r — > « 

log 2/ = _ - 4- c 
y = = Ae- 


2<r 


2<r 




4cr2 


■&I2 


It may be noted that since 71 = (q p)/cr, a closer approximation to the 
binomial curve when q is not equal to p is given by 


Idy _ t + 71/2 

ydt 1 + yit/2 


2 2/71 

7i 1 + yit/2 


Ti/2 


On integration this gives 


(2.32) y = Ae-2^/Ti(i + 

which is a skew curve known as Pearson^s Type III (see § 5.6). As ^ 00 , 

7i — » 0 and this curve tends to the normal curve. 

In (2.31), since t ranges from —00 to 00 as a; goes from 0 to 00 , we must 

have f dt = 1, whence A = (27r)“*^^2^ proof of this is given in 


Chapter III (see §3.2). We have, therefore, the standard equation of the 
normal curve, 

(2^3) 

E.e Je ^ 

Fit a normal curve to the binomial (i + f)^®. Directions: This binomial may be written 

(See Problem 6, § ,2.8.) Next recall that the equation of the normal curve is 

1 /«, 1 * „ **“* ^) 


4>{f) = 

V 2t 


and 


where 
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If we set iV = l,x = sp, and <r = (spqyf^ we shall expect that p will give, approximately, the 
values of f(x) for the various values of x. As in Chapter VI of Part I the following outline is 
suggested for organizing the computations. 


X 

t 


y 

m 




9 



Construct the histogram and draw the curve. It is suggested that paper ruled ^‘20 to the 
inch” be used. By comparing the last two columns and also judging from the figure, does 
the fit seem to be good, even though s is rather small and g = Ip? 


Note that judgments by eye as to goodness of fit are likely to be deceptive. 
As will be seen later, there is a better method of forming such a judgment by 
means of the chi-square test. 

The above exercise will help the student appreciate a theorem which will 
now be introduced. The sum of successive terms of the binomial equals the 
area of the corresponding rectangles in its histogram. We may obtain an 
approximation to this sum by finding the area under the fitted normal curve 
w'hich these rectangles occupy. Graphically, the values x = 0, 1, 2, • • • , s 
are the mid-points of the bases of these rectangles. Therefore, if we are sum- 
ming the terms of the binomial* in which x ranges from x = di to x = in- 
clusive, the corresponding area under the curve will be from a; = di — J to 
X — d2+ 2 - We must convert these values into standard units in order to 
enter a table of areas of the normal curve. Hence we have the following 
theorem. 


2.10 The DeMoivre-Laplace Theorem. 

Theorem 2.1. The sum of those terms of the binomial {q + pY in which the 
number of successes x ranges from di to d 2 , inclusive j is approximately 


( 2 . 34 ) 

where 





<r = (spqY^^ 


A careful and ingenious, but rather long, proof of the DeMoivre-Laplace 
Theorem is given in Uspensky^s book.® The great merit of this proof is that it 
provides an upper limit to the error involved in making the approximation, 
at least for reasonably large values of s. The result may be stated as follows 
and is exact: 


Theorem 2.2. 


Q = jf V(0 dt + [(1 - + 0 
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0-Z 


h. this reduces to 


(2.37) 

where 


Q = dt + Q 

= 2^(t2) - 1 + 


1 n 

j-. 


dt 


for (T > 5. 


Values of ^(x) are given in the Appendix, Table 1. (The table gives 
<l>(a:) — 0.5, since the integral is taken from 0 to x. See also § 2,13.) 

Example 1, Suppose p = .2 is the probability of success in a single trial. Estimate the 
probability of obtaining less than five or more than fifteen successes in fifty trials. 



Solution, The required probability, indicated by the shaded area in Figure 7, is 
r — 1 — Q where Q is the probability of obtaimng more than 4 and less than 16 successes. 
In using Theorem 2.1, we have 

sp =10, <r = 2.828, ii - -1.944, = 1.944 

Therefore, 

J r*l 944 

<f>it) dt = .0519 

0 

The exact prooability is obtained|^by evaluating and adding the sixth to the sixteenth 
terms of (.8 + .2)®® and subtracting the result from unity. However, instead of computing 
these terms separately, a systematic procedure may be set up by which each term is made to 
depend upon the preceding term. Thus we may write a binomial as follows: 

(q + py = 2‘(1 + fe)« = 2 * 1 1 + si + j 

where k = p/q. Then q‘> may be computed by logarithms and its product with the terms in 
the brackets may be obtained on computing machmes by a continuous process. Thus for 
the terms within the brackets, 


the second term is first term multiplied by sk 
the third term is second term multiplied by - k 
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the fourth term is third term multiplied by 


8-2 

3 


k 


the rth term is (r — l)st term multiplied by — ^ ^ 

In this way we find Q = .9497, so the required probability is F — .0503, For most practical 
purposes the approximation by use of Theorem 2.1 would be satisfactory. 

Uspensky^s limit of error is here somewhat* uncertain, since a is less than 5. 
Inequality (2.36) gives a value of 0.04 as an upper limit for j 0 |. In this case, 
the approximation is actually much better than the formula would suggest. 

Example 2. Find the probability that in throwing 100 coins one will obtain a number of 
heads which will differ from the expected number by less than five. 

Solution. 



Hence the required probability is given by 

Q «2 fVodi = .632 
Jo 

Here 1 ^2 1 < .0054, so that we can be quite certain that the probability lies between 0.626 
and 0.637. 

2.11 Simple Sampling of Attributes. It is a matter of common experience 
that certain fluctuations between observation and expectation under a given 
hypothesis may be explained on the basis of chance. For example, in throw- 
ing 100 coins an observed result of 45 heads and 65 tails does not warrant the 
conclusion that the coins are biased. In such cases a very natural question 
arises as to what sampling deviations may be allowed before we conclude that 
they indicate the operation of definite and assignable causes, that is, that the 
results are inconsistent with the given hypothesis. The theory dealing with 
such fluctuations in relative frequencies is called sampling of attributes. 

Suppose we are given a sample of s individuals of which x have a certain 
character or attribute. The question then arises: Is this result consistent 
with the h 3 qpothesis that the sample is drawn from a population having the 
fraction p with the given character? Could it reasonably have arisen on the 
basis of chance or is it significant of other than chance factors? In answering 
this question our common-sense judgment is greatly aided by a, probability 
scale for chance fluctuations imder the given hypothesis. We therefore restate 
our question more precisely as follows: 

Suppose thfe probability of an event is known from theoretical Gonsidera- 
tions to be equal to p. What is the probability that in ^ trials the number of 
successes will differ numerically from the expected number sp by as much as 
(or more than) an observed amoimt d? 
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The required probability may be estimated by means of the following 
corollaries to the DeMoivre-Laplace Theorem. 

Theorem 2.3. The probability that the number of successes x in s trials will 
differ from the expected number sp by more than d is approximately given by 
Ps = 1 — Qs where ^ 

Q, = 2 [\{t) dt and 5 = 

Jo ^ 

Theorem 2.4. If the words ^‘more than^^ in Theorem le replaced by ‘^as 
much as” then 5 = (c? — |)/<r. 

The proofs are obvious if we admit that the normal curve fits the histogram 
of the point binomial. 



In another slightly different form involving relative frequencies, Qb gives an 
approximation to the probability that the difference between an observed 
relative frequency of success x/s and the true probability p satisfies the rela- 
tion 


(2.39) 




for every assigned positive value of 5. 

In using Theorem 2.3, Table 1 gives a general idea of the magnitudes of 
probabilities for certain deviations. It is divided into two sections: the first 
section lists probabilities for specially selected deviations, the second section 
lists deviations for specially selected probabilities. 


Table 1. ABEtOGED Normal Probability Scale 


Deviation 

5 

Chance of Deviation 
Outside ± d 

Deviation 

5 

Chance of Deviation 
Outside ± 5 

0.5 

.617 

.67 

.50 

1.0 

.317 

1.28 

.20 

1.5 

.134 

1.64 

.10 

2.0 

.064 

1.96 

.05 

2.5 

.0124 

2.33 

.02 

3.0 

.0027 

2.58 

.01 

3,5 

.00047 

3.29 

,001 
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A computed probability is used to scale our judgment as to whether the 
deviation in question can be explamed on the basis of chance. If it cannot 
be so explained, it is said to be “significant^^ of other than chance causes. 
In passing judgment on a deviation it is sometimes difficult to give a definite 
answer. Good judgment in these matters is sharpened by experience in the 
particular field. However, it is customary in much experimental work to 
describe an effect as “ significant'^ if the probability of its arising by chance, 
on the hypothesis that its true value in the parent population is zero, is less 
than 0.05, and as “highly significant" if this probability is less than 0.01. 
This is the convention adopted by R. A. Fisher and is the one we shall follow 
as a rule. The level of significance to be used in any particular case will 
frequently depend, however, on the seriousness of drawing the wrong conclu- 
sions. If human lives are dependent on the reality of an effect, it is unlikely 
that one would be satisfied with a probability of- 1 in 20 of being wrong in 
claiming that the effect exists. This point will be discussed later in connec- 
tion with sampling theory and statistical inference. 

Example 3. {Eteiz) A group of scientific men reported 1705 sons and 1527 daughters. 
The examination of these numbers brings up the following fundamental questions of simple 
sampling. Do these data conform to the hypothesis that | is the probability that a child to 
be born will be a boy? That is, can the deviations be reasonably regarded as fluctuations in 
simple sampling under this hypothesis? In another form, what is the probability in throw- 
ing 3232 coins that the number of heads will differ from (3232/2) = 1616 by as much as 
d = 1705 - 1616 = 8^ 

Solution, s = 3232, (pqsyi^ = 28.425, 5 - = 3.113, Ps = 1 - 2j dt = 

1 - .9981 « .0019. ’ ^ 

Hence we conclude that these data cannot be explained on the basis of chance, that is, they 
are inconsistent with a hypothetical sex ratio of i. 


In the statement of the DeMoivre-Laplace Theorem, di and d 2 were in- 
tegers. In Theorem 2.3 d need not be an integer, since sp is not necessarily 
integral. 

The more exact statement of Theorem 2.3, due to Uspensky, is as follows: 


Theorem 2.S. 

(2.40) 
where 

(2.41) 


The probability of the inequality ] x — sp | < is given by 

Qsi — ^ f 4>(f) dt H ^ •i’CSi) + Qi 

Jo 


l£2xf 


5i = d/(T, 

0.20 + 0.25 \v-q\ 
^ *> 


4- e--3<r/2 


/or O' > 5, and h and are the fractional parts respectively of sq + d and 
sp + d. 


That is, if di is the integer equal to or next above sp — dj and d^ is the integer 
equal to or next below sp -f- d, 
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(2.42) sp — d = s — (sq + d) = di -- 6i 

(2.43) sp + d - d2+ S2 


The sign of equality in Theorem 2.5 can occur only if sp + d or + d is an 
integer. 

If in Theorem 2.3 we put § = 5i + 1/20-, and if s is large enough for l/2cr to 
be small, compared with <l>(di), we obtain 

(2.44) = 2jr%(0 di + 2:^ <k(SO 

approximately, which indicates that differs from Qs only by terms depend- 
ing on 61 / (T and 62 / 0 - and by the error term Oi. 

Note that if s tends to infinity while 81 remains fixed, the probability that 
] rc — sp 1 < 5^0" tends to the limit 

dt = 24>(3,) - 1 


where <i>(r) is the distribution function of the normal law. 

This is a very special case of an extremely general theorem, knovm as the 
Central Limit Theorem, of which something more will be said in Chapter IV. 

2.12 Bernoulli’s Theorem. If an event occurs x times in s trials and if p is 
the probability of success in a single trial, the probability that the relative 
frequency of success x/s will differ from p by less than any given arbitrary 
positive quantity e tends to 1 as 5 increases indefinitely. 

On the frequency definition of probability this theorem is a mere tautology. 
If, however, we suppose that probabilities can be assigned without performing 
an indefinitely long sequence of trials, the theorem provides a link between 
the probabilities so assigned and the results of often-repeated trials. 

If € is given and \ x/s — p \ < e, x may lie between sp — se and sp + se, 
inclusive. 

Taking 8, = se/a, the probability that ] x/s ~ p 1 < € is, from (2.40), 
given by 

Qsi = 2 ^ <t>{t) dt -j ^ + 0i 

and as 5 — > 00 

Qs. 2 jf dt = 1 

This proves the theorem. 


Example 4. If p = f and s = 6520, what is the probability that the relative frequency of 
success will not differ from f by more than 0-01? 

Here a = = 39.56, € = 0.01, sp — se = 3912 — 65.2 = 3846.8, sp + se = 3977.2 

Hence = 0.2, Si 65.2/39.56 *= 1.648, and the required probability = 24>(1.648) + 

(0.6/39,56)<^(1.648) - 1 + Oi = 0.90220 + Qi. Here \n, \ < (0.20 -f 0.05)/1565 = .00016, 
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SO that the calculated probability is certainly not in error by more than two units in the 
fourth decimal place. 

Example 5 (Wolfes Dice Experiments). From time to time, attempts have been made to 
verify the calculations of probability theory by making actual experimental trials with dice, 
balls, or other apparatus. Among the most extensive of such trials is one made in 1850 by 
the Swiss astronomer Wolf,® in the course of which two dice were thrown 100,000 times. 

One result was that the frequency of unlike pairs was 83,533 as compared with a theoreti- 
cal value of I X 100,000. The probability of a discrepancy not greater than that actually 
obtained is Qsi where 


sp = 83,333i 01 = ^, 02 ^ 0, 


Si 


1991 


= 1.6943 


, i . „7.85 

0 


so that 

Oft = 2'i'(1.6943) - 1 + ^<^(1.6943) + Oi 


= 0.90962 + 0.00054 + Oi 


with I Oi ] < 0.000026. The required probability is, therefore, 0.9102, almost exactly. 
Since this differs from 1 by about 0.09, w’e can hardly say that the result was surprising, A 
larger discrepancy would have occurred about once in 11 times with perfect dice. Some 
others of Wolffs results, however, indicate that his dice were, in factj decidedly imperfect. 
All such experimental trials merely serve to demonstrate the accuracy or lack of accuracy of 
the apparatus used. They can have no bearing on the validity of the axioms of probability 
theory. 


2.13 Tables of the Normal Law. The normal law, as we shall see later, 
occupies a central position in the theory of probability and mathematical 
statistics. It is the limiting form assumed, not merely by the binomial law, 
but by a very large class of frequency fimctions as the size of the sample 
increases, and moreover the assumption of a normal distribution of errors of 
observation is fundamental in many widely used techniques of statistical in- 
ference. The normal law (or error law, or probability integral) has, therefore, 
been extensively tabulated, but there is considerable diversity of arrangement 
among the tables. 

The distribution function for the normal law is 


where 




<j>(t) dt 








Table I in the Appendix to this book lists and jT^co dty so that ^{t) is 

given by adding 0.5 to the tabular values. Part III of Glover^s Tables of 
Applied Mathematics in Finance^ Insurance, Statistics also gives <j>{t) and 


J 


dt, as well as the derivatives of <l>(jt) from the 2nd to the 8th. 
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Some books, including Peirce’s Short Table of Integrals, tabulate 

2 

g(x) = - 7 =r I dt 

^ Jo 

It is readily seen by changing the variable of integration to w = V2^ that 

g(x) = 2^iV^) - 1 

Again, Pearson’s Tables for Statisticians and Biometricians ^ Part I, Table 
II, denotes <i>{x) by z and 
1(1 + a) = ^{x). 

In Fisher’s Statistical Methods for Research Workers, 10th Edition, page 77, 
the table gives values of x for selected values of P, where 

X X rx 

<t>{t) dt =: 1 - 2 dt 

= 2[1 ^(x)]. 

This arrangement is convenient for many purposes, but if it is required to 
determine areas xmder a normal curve the table has to be read ‘^inside out.” 

Tables of the normal law are sometimes disguised as probits, e.g., in Stans’- 
tical Tables by Fisher and Yates, Table IX. If A is the percentage area up to 
the ordinate at x, A = 100<l>(x), and the probit corresponding to A is 5 + x. 
The 5 is added to avoid negative values in all cases that are likely to occur in 
practice. Thus for A = 10, #(x) = 0.1, x = —1.282, and the probit is 
5 — 1,282 == 3.718. Kelley^ s Statistical Tables (Harvard Univ. Press, 1948) 
give X and 4>(x) to 8 decimal places corresponding to values of 4>(x) at every 
0.0001 between 0.5 and 1. In these tables ^{x) is denoted by p and <t>{x) by z. 

The most extensive and accurate tables of the normal law now available 
{Tables of Probability Functions, Vol, II) were prepared under the Federal 
Government Work Projects Administration and pubhshed by the National 
Bureau of Standards, 1942. These give <t>{x) and 24>(x) — 1 to 15 decimal 
places at intervals of 0.0001 between 0 and 1, at intervals of 0.001 between 
1 and 7.8, and at larger intervals up to x = 8. An auxiliary table continues 
as far as x = 10 to 7 significant figures. Vol. I of these tables gives values of 
comparable accuracy for the error function 2 /Vt c®* and its integral. 

2.14 The Poisson Exponential Approximation. If p (or q) is small the 
normal curve cannot ordinarily be used with confidence to approximate the 
terms of the binomial (g + p)*. If 5 is large but sp is of moderate size (of 
the order of 10), a useful approximation to 

(2-45) 

may be given by means of the Poisson exponential function. Statistical 
examples of this situation are sometimes called rare events and occur in widely 


cl>{t) dt by a. The quantity tabulated is 
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different fields; for example, the number born blind per year in a large city, 
the number of organisms of a given size S on a given glass slide that escape 
death by X-rays after being exposed for t seconds, the number of times in a 
certain year that the volume of trading on the New York Stock Exchange 
exceeds M million shares, the frequency of certain peaks’^ in a given time 
interval such as occur in telephone ‘traffic, and other problems in demands 
for services. 

The word “rare’’ means “individually rare.” In a large population several 
such events may occur, but the probability of occurrence of each individual 
event is small. 

The approximation is sometimes used with values of p as high as 0.05, 
but s should not then be larger than, say, 200, and even so the approximation 
may be 3 or 4 per cent in error. 

Suppose, then, that p is the probability for the occurrence of the rare event 
in question, and assume that g = 1 — p is nearly unity. From (2.45) 


p(x, s) 


1 ) 


-X+1) 


xl 


r(i - pY 


- V syy sj \ s j MV/, _ mV"® 

xl UA d 

where m = sp. Hence 

(2.46) .) - (l - i)(l - -5 • • ■ (l - f (l - f )'(l - 



Now if a; is fixed and s oo while p 0 in such a way that m remains finite, 
the factors 1 — 1/s, • • • , 1 — (a; — l)/s, (1 — all — » 1. Also, it is well 

known that 

lim (1 — m/sy = 


so 

(2.47) limp(a;,5)=^ 

5 — ^ 00 Xl 


which is Poisson^ s exponential function. For rare events it gives an approxi- 
motion to the true probability of x successes in 5 trials. Tables of this func- 
tion are to be found in Pearson’s Tables for Statisticians and Biometricians, 
Part I, Table LI, and in Fry’s Probability and its Engineering Uses, More 
extensive tables have been compiled by Molina.^ 

Uspensky has given a formula which indicates the accuracy of the approxi- 
mation, namely, 

(2.48) p(x,s) 


where 


^ = [1 - eg(x)y^^\ 0 <e <l 

^ - a:)m^ 

3(5 - my 2s{s - x) 
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and 

Thus, for p = 1/100, 5 = 1000, m = 10, the Poisson approximation for 
a: = 10 is lOi^e-iyiO! = 0.1251. Here gr(lO) = 0.000845, h(10) = 0.0055, so 
that P == 1.005515(1 — 0.0008450), that is, 13 lies between 1.004665 and 
1.005515, or p(1000, 10) lies between 0.12568 and 0.12579. The exact value, 
obtained by using logarithms of factorials, is 0.12574, which is about halfway 
between the limits assigned. 

The terms of the series 


(2.49) 








l+m+f +^+ 


+ - + 
^ a:! ^ 


give the approximate probabilities of exactly 0, 1, 2, • • • x • • • occurrences of 
the rare event in s trials. Hence the probability that the number of successes 
is at least equal to x is approximately given by 


(2.50) 


Qm(x) = ^ 


nve" 


This function is tabulated by Fry and by Molina. 

According to Uspensky, the true probability of at least x successes in s trials 
is given by 


(2.51) 


Pma(x) = Qm(x) + A 


where 


and 


1 A 1 < - l)Q„(x + 1), if Q„(x + 1) > I 

I A 1 < - 1)[1 - Q„(x + 1)], if Q^ix + 1) < i 


X = 


Thus, with the data above. 


, 1 , 

"* + i + T 

2{s ■— m) 


Qio(lO) = 0.54207, X = 0.00568 
1 A 1 < 0.005696(1 - 0.41696) = 0.00332 

so that Pio,iooo(10) lies between 0.5388 and 0.5454. The true value is very 
close to 0.5421. 

Certain simple and interesting results may be obtained for the moments 
of the di^ribution given by (2.47) when x takes all integral values from x = 0 
to X = s.' First we observe that from (2.49) 


approximately if s is large. 


2 


ac-O 


m^e'^ 

x! 


= 1 


Then 
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(2.52) 

And 


E{x) - vi= '^xfix) 

= me-[l + m + fj'+ • • • + 5 ?^] 
s= m — sp, approximately 

s 

V2 = 

0 

= '^[xix - 1) + x]fix) 

0 

= m{m + 1 ), approximately 


The theoretical Poisson approximation is the limiting distribution when 
s 00 . For this limiting distribution the above values of n and V 2 are exact. 
From these results, we have 


Mean m — sp 

== + 1 ) 

(2.53) = m 

(2.54) <T = (m)i/2 


It may also be shown that 

= m(m2 + 3m + 1) 

Pi = m(m^ + 6 m^ + 7m + 1) 

whence we find that 


(2.55) 

( 2 , 66 ) 

and 

(2.57) 


/X3 = ^ 

M 4 = 3m^ + m 

1 1 


It is a rather striking result that each of the mean, variance, and jus is equal 
to m. 

The importance of the Poisson approximation in dealing with certain 
problems in telephone engineering and other fields is discussed in Fry^s book, 
Probability and Its Engineering Uses. The interested student might investi- 
gate and prepare a special report on some of these applications. 

A recursion formula for the moments of the Poisson distribution is readily 
obtained. Thus 


00 


a :»0 


xl 
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Hence 

^ i (m " ')] ^ 

or 

(2.58) Mr+l = W^rjUr-l + ^ ^ 

Putting Mo = 1, Ml = 0, we get successively, 

IJ.2 = m 
M3 = 

M4 = Sm^ + m 

M5 = 4m^ + 6m^ + m = lOm^ + m, etc. 

2.15 Poisson and Lexis Schemes. In the Bernoulli scheme (§ 2.1), we con- 
sidered s trials of an event with a constant probability p of success in a single 
trial. Variants of this scheme are associated with the names of Poisson and 
Lexis. 

In the Poisson scheme^ the probability of success v%j ac the ith trial, varies 
from trial to trial. If x is the number of successes, 

(2.59) E{x) = 2^(2.) = 

4-1 i 

where takes the values 1 and 0 with probabilities px and respectively. 
Also, since the trials are indej^ndent, the variance of x is 

(2.60) Var (x) = ^''^ar (z.) = = 2^* “ 2^** 

i i i i 

Let p = mean of the pt = 1/s ^p%j and let 

i 

cTp^ = variance of the 

= ^ - py 

-5 

Then 

(2.61) Var (x) = sp — sicrp^ + p^) = spq — so-p^ 

and hence is less than if all the p* are equal to p. 

In the Lexis scheme, we consider n sets of s trials each, the probability of 
success being constant within each set but varying from one set to another. 
The number of successes in the jth set is x, and its expected value is spj. The 
expectation of the total number of successes is 

n 

E(x) == s ^Pj = snp 


( 2 . 62 ) 
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where p = 1/n so that the expected number of successes per set of ^ 

trials is sp. 

In the jth. set, 

jEJ(xj - sp^y = sp,qj 

so that 

E(Xj - spy = E(Xj - spj + spj - spy = spjQj + (spj - spy 
The variance of x is therefore given by 

Var (x) = s - pY 

J i 

= s '^p, - s ■" 

n n 

Putting ]^Pj == np, ^(pj p)^ = we obtain 

(2.63) Var (a;) = snp — (sncrp^ + snp^) + s^rixp^ 

= snpg + s{s — l)n<Tp2 

so that the variance of the number of successes per set of s trials is 

(2.64) Var (x/n) = spq + s{s — 

and hence is greatev than if the probability of success remained constant from 
set to set. The second term depends on the variation of probability from set 
to set, a variation which in the applications piay possess physical significance. 

If 0-2 is the variance of the number of successes in an experimental set of s 
trials, and if is the variance calculated on the assumption of a Bernoulli 
distribution, the ratio 

(2.65) L = <t/(tb 

is called the Lexis ratio. The dispersion is said to be subnormal if L < 1 and 
supernormal if L > 1. 

2.16 The Hypergeometric Distribution. If we have a finite population of 
size n, composed, say, of black and white balls, and if we draw a sample of s 
balls without replacements, the probability of drawing a white ball at any 
trial no longer remains constant but depends on the results of earlier trials. 

In fact, if p is the original proportion of white balls in the population, the 
chance of obtaining x white and s — z black balls in s trials is given by 

(2.66) hnix, s) = 

(np) \ (ng)l si (n s)l 

xl (np a;) ! (s — x) ! {nq ^ + a;) ! n! 
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By using Stirling’s approximation for the factorials containing w (§ 3.3), 
it is easily seen that 

(2.67) jim^ hn{x, s) = PV~^ = P(^> ») 

Hence sampling with replacements from a finite population gives the same 
distribution of successes in s trials as sampling without replacements from an 
infinite population. This, of course, is to be expected since removing a finite 
sample from an infinite population will not affect the chance of success in a 
new trial. 

The function hn(Xj s) may be written 

(2.es) 

where A == (nq ) ! (n — s ) !/ (nq — s)ln\ Hence the probability of x successes 
in 5 trials is given by the coefficient of in the expression 

Fi 1 „ '^p 1 s(s - 1) npjnp - V)u^ . _ _ _ 


(2.69) a\i + s u + o' - 7 — + • • • 

L nq — 5 “t~ 1 1*2 (uq — 5 H“ i){iiq — s -f* 2 ) 

npjnp - 1 ) - ♦ (np - g + 1 ) 1 

{nq — s + l){nq — s + 2) • * * nq J 

- The part of this expression in square brackets may be written as a hyper- 
geometric function. 

The hypergeometric function is defined by the infinite series 

(2.70) y(o,, ft 7 ; .) - 1 + • 

If ^ = — np, 7 = ng — 5 + 1, this series terminates, since ^ is an 

integer, and agrees with the bracketed expression in (2.69). Therefore . 

s 

(2.71) ^^n{Xi s)u^ = AF{'-s^ —np, nq — s + 1 ; u) 

and for this reason the discrete distribution given by the values of hnix^ s) 
for a; = 0 , 1 , 2 , • • • 5 is called a hypergeometric distribution. Putting = 1 in 

(2.71) , we have 

s 

(2.72) 1 = s) = AF{—s, —np, ng — s + 1 ; 1) 

x=0 

Differentiating (2.71) with respect to we have 

s ^ 

^hn{x, s)xu^-^ “ ^ — wp, ng ~ 5 + 1 ; w) 

Putting w = 1 , this gives 

(2.73) E{x) = '^xhnixj s) - A * F^{—s, ^-np^ nq — s + 1;1)' 
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where F' stands for dF I du. In the same way 

(2.74) E{x{x - 1)1 = AF'\-s, -np, nq - s + 1; 1) 
and so on. 

Now F(a, 0, y; u) satisfies the differential equation 

(2.75) u(l - w)F" + [y - {a + l)u}F' a^F = Q 
so that when = 1 , 

(7 ““ a ““ id — 1 )F' = a^F 
or, substituting for a, jd, 7 , 

(2.76) F' - spF 
Hence from (2.72) and (2.73), 

(2.77) E{x) = sp 

which shows that the expected value for x is exactly the same in this problem 
as in the simple Bernoulli scheme. 

Differentiating (2.75) and then putting u = 1, we get 

-F" + (y- a- p- 1)F" - (a + p+ 1)F' - afiF^ = 0 
or 

(7 - a ~ id - 2)F" = (a/d + o: + id + 1 )F' 

Substituting; for a, jd, 7 , we find 

(2.78) (n - 1)F" = (np - l){s - 1 )F' = (np - l){s - l)spF 
Therefore 

_ (np - l)(s - 1 ) 

so that 

(2.79) E{x^) = {(np — !)(« — l)/(72. — 1) + 1} sp 
whence the variance of x is given by 

(2.80) E(x^) — {E(x)}^ — sp{(np — l)(s — l)/( 7 i — 1) + 1 — sp} 


71 


n 


■ CTB" 


Consequently, for s > 1, the variance is less than it would be if the popula- 
tion were infinite, or if the drawings were made with replacements. 


Problems 

1. Use Theorem 2 1 to approximate the following sums: 

(а) the terms of 4- in which 50 < a: < 70. 

(б) the terms of ^946 -h .054)®*^ in which x > 33. 



Problems 


51 


2. Fit a normal curve to the point binomial (| + 

3. Fit a normal curve to (i + iy. 

4. Suppose you are studying IQ^s and it is known that 20% m the universe with which 
you are dealing have an IQ below Mj so that t is the probability that an individual chosen at 
random has an IQ below M. {M itself has no bearing on the solution of the problem.) If 
a teacher had a class of fifty which could be regarded as a random sample from this universe, 
would it be exceptional if she found fewer than five or more than fifteen with IQ's below M? 
(See Example 1.) 

5. (Camp) A dean's report showed the following figures: 


Subject 

Honor Grades 

Failures 

Number 

Examined 

Number 

% 

Number 

\ 

% 

German 


36 


6.3 

521 

Mathematics 


35 


8.2 

466 

Music 

11 

50 

! 0 

0.0 

1 22 

All Subjects 


38 


5.4 



Taking p — .38 for honor grades and p = .054 for failures find the probability: (a) that 
in selecting 521 students at random (from a supposedly infinite number), one would obtain 
as few honor grades ^s were obtained in German; (5) as many faOures; (c) in selecting 466 at 
random, one would obtain as few honor grades as were obtained m mathematics; (d) as many 
failures; (e) in selecting 22 at random, one would obtain no failures (as in music); (/) eleven 
or more honor grades. 

Hints, (a) Find sum of terms of (.62 + .38)^^^ in which x < 187. 

(5) See Problem 1 (b) above. 

(e) Evaluate (0.946)^ by logarithms. 

6. (Burgess) If analyzed past experience shows that 4% of all insured white males of 
exact age 65 have died within a year, and it is found that 60 of a similar group of 1000 
actually die within a year, should the group be regarded as essentially different from the gen- 
eral mass — that is, is the departure from the expected mortality greater than might be 
expected as a result of chance variation alone? 

7. (Eichardson) In a coin tossing experiment in which a coin was tossed 400 times, 250 
heads appear. Do you believe the experiment was honestly performed? 

8. (Lovitt and HoUzclaw) Would you be willing to bet 10 to 1 that an opponent could not 
throw the sum 7 with two dice at least 23 times m a hundred throws with two dice? 

9. A factory produces units of a standardized article at the rate of 1000 per day, with a 
probability of 0.05 that any one unit selected at random will be defective. Find the prob- 
ability that the number of defective units produced in one day will be not less than 40 nor 
more than 60. Ans, 0 863 

10. Professor J. E. Kerrich in Aw Experimental Introduction to the Theory of Probability^ 
Einar Munksgaard, Copenhagen, 1946, has discussed very fully some experiments he made 
during a period of enforced leisure in a Danish internment camp. In one of these, a coin was 
spun 10,000 times, and the number of heads was 5067. Estimate the probability of a 
discrepancy at least as great as this, if the true probability of '"head " with the com used was 
precisely 0.5. 

11. A coin is tossed s times. It is desired that the relative frequency of the appearance of 
heads shall not be greater than .51 or less than .49. Find the smallest value of s that will 
insure the above results with a degree of certainty Qs = .90. 
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Solution. 


We have 


We must determine s such that Qs = .90 that 



S = .02Vs 


since — g = Also 


Qs — 2 4>(t) 


dt = .90 


whence from the tables we find 5 — 1 645. Therefore, 


and 


mVl = 1.645 


5 = 6745, approximately. 

The exact solution is not readily obtainable, but we can take a trial value of s and see 
with what probability the required condition is satisfied. 

Thus if we let s = 6800, we find se == 68, 


and 


where 


= ^2 = 0, cr = 41.23, h == 1.649, 


Qs, - 2^(1.649) +41^ <^^1-649) - 1 + Oi 


10:|<^ = .CC012 


Since = 0.90334 + fli, it is evident that the probability of the desired result is at least 
0.90 when s — 6800. 

12. A coin is tossed s times. It is desired that the relative frequency of the appearance of 
heads shall not be greater than .502 or less than .498. Find the smallest value of s that will 
insure the foregoing results with a degree of certainty Qs - i|. 

13. (Camp) A census report showed that in general 59.58% of New York City children 
went to school, but that only 56.8% of the Negro children went to school. The number of 
Negro children was 20,000. Was the difference due to chance? 

14. A tosses a coin, agreeing to pay B a dollar if ib falls ^^head,'^ and B agrees to pay A a 
dollar if it falls '‘tail.” They continue this game for 1000 tosses, keepmg score, but not set- 
tling up until the end of the series. If A has m dollars and B n dollars, calculate the prob- 
ability that the loser can pay what he owes. 

Show that this probability is approximately equal to 4>(m/VlO0O) + 4>(?i/'v/l000) — 1, 
and calculate it for m = 20, n = 30. Ans, 0.565. 

16. Within what limits will the number of heads lie, with 95% probability, in 1000 tosses 
of a coin which is practically unbiased? 

Assume that “practically unbiased” means that the true probability of “head” with this 
coin does not differ from 0.5 by more than O.COl. Ans 470 to 530. 

Hint . Take p = 0.499, calculate 5 from the relation Qd - 0.95, and hence find 
e = sVpg/lOOO. Limits of x are then given by | a;/1000 — 0.499 | < e. Slightly different 
limits are given if p = 0.501. Take limits including both sets and verify that Qsi is actually 
> 0.95. 

18. A company uses many thousands of electric lamps annually, burning continuously 
day and night. Assume that under such conditions the life of a lamp may be regarded as a 
variable normally distributed about a mean of 50 days with a standard deviation of 19 days. 
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On January 1, 1951, the company put 5000 new lamps into service. How many would 
you expect to need replacement by (a) February 1, (b) April 1? The lamps may be supposed 
all put into operation at about the same time of day. Ans. (a) 794, (b) 4912. 

17. A pair of dice are thrown 3000 times and the two numbers that turn up are different 
from each other in 2726 of the throws. Does this result suggest a doubt as to the accuracy 
of the dice? 

18. {Bertrand) The proprietor of a gambling establishment complains to the makers of a 
roulette wheel which he has installed that the wheel seems to favor red. In 1000 trials of 
which he has kept a record, red has shown up 515 times, black 455 times and white 30 times, 
the theoretical proportions being 18:18:1. The makers reply that the wheel was carefully 
constructed, and that nothing can be done about the laws of chance. What would be your 
opinion? 

19. Prove that the mode of the Poisson distribution is the integer lying between m — 1 
and m, unless m happens to be an integer, in which case the values for m — 1 and m are 
equal. 

20. ^^If we spin a halfpenny nothing withm our knowledge may be able to decide whether 
it will come down head or tail, yet if we throw up a million tons of halfpennies we know 
that there will be 500,000 tons of heads and 500,000 tons of tails.” (Sir James Jeans.) 

Comment on this statement. Assuming that 160,000 halfpennies go to a ton, what can 
we really say about the result of this experiment? 

21. Show that the tangents to the curve y == <l>{t) at its inflection points intersect the i-axis 
at ^ =±2. 

22. If X has the frequency function 

f(x) - {2Tray^t2Q-(.x~m)V2a^ _0O < X < 00 

show by integration that the mean is m and the variance a, 

23. A continuous variate x has the frequency function 

fix) == Cx^f^il - a:)3/2, 0 <x <l 

(a) Show that this function vanishes with infinite slope at x =0, vanishes with zero slope 
at a; = 1 and has a single maximum at x ~ Sketch the curve. 

(b) Determine the constant C so that the area under the curve is unity. 

(c) Show that n = 3/8, /xi = 3/64, ti = 2^3/9, 72 = -2/3. Am. C = 16/x. 

24. In a fairy story, the fairy godmother assures the queen that her baby prince will not 
die until the following condition has been fulfilled. A scroll is prepared, and on the day of 
birth and every subsequent birthday a letter of the Greek alphabet, chosen at random, is 
entered on the scroll. ‘ On the day that all the letters of the alphabet appear, he will die. 
What is his expectation of life? There are 24 letters in the Greek alphabet. Am. 89.6 
years. 

Hint Suppose that at any stage there are n letters remaining to be picked. The chance 
that the next letter picked will belong to this group is n/24. The chance that it does not, 
but the following letter does, is (1 - n/24)n/24, and so on. Hence show that the expected 
number of trials before one of the group is picked is 24 /n. 

Therefore the total expectation is 24(1 + | + • • + it)- 

References 

1. J. Shohat, “Stieltjes Integrals in Mathematical Statistics,” Ann. Math. Stat, 1, 
1930, p. 73. 

2. I. Kaplansky (/. Amer. Stat Assoc., 40, 1945, p. 259) has shown by means of examples 
that kurtosis has not necessarily anything whatever to do with “peakedness.” A distri- 
bution with a perfectly flat top may have infinite kurtosis, and one with infinite peakedness 



54 


The Binomial Distribution 


II 


(in the sense that f(x) oo as |a;| 0) may have a negative excess. Nevertheless, for 

many distributions encountered in practice, a positive t 2 does mean a sharper peak with 
higher tails, than if the distribution were normal. 

3. A. T Craig, Bull. Amer. Math. Soc.f 40, 1934, p. 262. 

4. W. Feller, ‘^Note on the Law of Large Numbers and Tair’ Games,” Ann. Math. 
Stat., 16, 1945, p. 301. 

6. J. V. Uspensky, Introduction to Mathematical Prdbahthty^ 1937, Chapter VII, 

6. See an account, with references, in J. M. Keynes, A Treatise on Probahiliiyj p‘. 362. 

7. B. C. Molma, Poisson^ s Exponential Binomial Limit , 1942, 



CHAPTER III 


SOME USEFUL INTEGRALS AND FUNCTIONS 


To avoid interruption later on we discuss here certain integrals and functions 
which will be useful in subsequent chapters. 

3.1 Some Properties of Definite Integrals. You will recall from elementary 
calculus that a definite integral is a number defined by a limiting process. 
This number is designated by 


I 


fix) dx = Fih) - Fia) 

■where /(a;) is a function of the real variable x continuous in the closed interval 
(a, h), and Fix) is such that F'ix) = fix) at all points of (a, 6). 

A function/(x) is said to be even if /(—a;) s fix) and odd if /(— x) = — fix). 
The following properties of definite integrals are frequently useful : 

(®) r /(^) = ( fi^) dx + f fix) dx 

Ja Ja Jc 

regardless of the relative positions of a, b, c on the real number scale. 

ib) j fix) dx = 2 J fix) dx, if fix) is even, and = 0 if fix) is odd 

J r® z'® 

' fix) dx = I fia — x)dx 
0 Jo 


Example. 


Sin*" 0dS = sin*" / - ^ ejdB ^ cos*** 0 dd 


(d) If f{z) can be expanded in a power series which converges for all values 
of a; in the interval of integration, the series may be integrated term by term. 
It is thus often possible to exhibit a definite integral as a convergent series. 

(e) If either the upper or lower bound of integration is infinite the integral 
is said to be improper. The value is defined by a limiting process; 


^00 

I f(x) dx = lim / f{x) dx 

Ja Ja 

f f{x) dx = lim f f{x) dx 

J-OO J^a 


and 
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If the limit exists, the improper integral is said to converge; if the limit does 
not exist, the integral is said to diverge and no value is attached to it. 


(/) 


^00 

I f(x) dx is defined as I f(x)dx+ l fix) dx for any finite value 

^ - 00 oo fc 


of c. 

{g) 


Property (6) holds when a is infinite. 
Differentiation under the integral sign. Given 

H{B) = jj(x, 6) dx 


where ^ is a parameter and a and 6 are differentiable functions of and where 
dfix^ B)/dB exists and is less in absolute* value than some integrable function 
of X for all admissible 6 and for all x in the interval (a, 6), then 


dB 




When a and h are independent of 6, this reduces to 



3.2 The Gamma Function. The improper integral 

(3.1) r(7i) == J* x^-^^e'-^dx 

which converges for n > 0, is called the Gamma function of the positive 
number n. The difference equation 

(3.2) Tin + 1) = nVin) 

is easily established from (3.1) by integration by parts (see the chapter on 
the Gamma function in any textbook on advanced calculus). By successive 
reduction of (3.2) we obtain 

r(n + 1) = nin — 1) • • • (n — ifc)r(n — k) 

where ft is a positive integer less than n. If n is also a positive integer and 
k - n — 1, then we have 

(3.3) Tin + 1) == nl 

smce from (3.1), r(l) = 1. Because of (3.3) the Gamma function is some- 
times called the factorial function. It may be considered as a generalization 
of nl when n is fractional. In fact, we can define n\ for any value oin> — 1 
by the equation, 

(3.4) n\ = ^ x'^e^dx 

and the use of a separate symbol for the Gamma function is really unnecessary. 
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However, the symbol is so widely used in books and tables that the student 
should be familiar with it. The graph of the function is shown in Figure 9. 
It can be drawn from the following values, some of ^ 
which follow immediately from (3.2) ; the others will be 
established later. 

r(0) = 00 r( 2 ) == 1 

r(i) = 1 r(3) = 2 

r(|) - r(4) 6 

Other forms of (3.1) may be obtained by changes 5 
of variable. For example, 




(3.5) r(n) = 2 j[ ^ ^ y2 

Now we can show that 

(3.6) JJe-^dy 




12 3 4 

Fig. 9 


To establish (3.6) we first observe from (3.5) that 


(3.7) 


r( 




e^dy 


Since (3.7) is independent of the variable of integration, we may also write 



r(i) = 2 r e-"' dz 

Jo 

So 



[r(i)]’' ~ ^ f f 

(3.8) 

/^OO ^00 

= 41 j dx dy 

Jo Jo 


the passage from the product of two integrals to the double integral being 
valid since neither the limits nor the integrand of either integral depend on 
the variable in the other. 

To evaluate (3.8) it will be convenient to change to polar co5rdinates. 
First, however, we will make a few remarks about a change of variables in 
general. Let x and y be the codrdinates of a point with respect to a set of 
rectangular axes in a plane, u and v the codrdinates of another point with 
respect to a similarly chosen set of rectangular axes in some other plane. 
Suppose we have a function of the variables x and y, 


^ y) 

and we make x and y depend on new variables u and v by the relations 
X = g{uy v) and y — h{u, v) 
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These relations establish a certain correspondence between the points of the 
two planes. Let dA be an element of area for the function /(x, y). Then it 
is shown in advanced calculus * that 


dA^ jm 

\u, V, 


where j J 
minant 


^ convenient s 3 nmbol for the absolute value of the deter- 


dx 

dx 

du 

dv 


§1 

du 

dp 


du dp dv du 


and the latter is called the Jacobian ov Junctional determinant of the transfor- 
mation. 

If, then, we change (3.8) to polar coordinates by letting 


(3.9) 

the Jacobian is 


X — r cos 6 
y — r sin d 

— r sin ^ 
r cos 6 


Therefore, the element of integration dx dy becomes r dr dO. The limits of in- 
tegration are now from 0 to oo for r and from 0 to t/2 for 6. From (3.9), 
+ 2/^ = So (3.8) becomes 


[rm 


Hence, 

(3.10) 


^Tt/2 ^00 

/•< r /2 

= 2j de = V 


dr dd 


r(|) = (,r)i« 


For a more general form of (3.6) we may let y — tj (2^)^/^, k> 0, and 
obtain 


(3.11) 
and 

(3.12) 


r- 


e-emdt = ^{2vkyi^ 
^-mtcdt = {2Tky>^ 


See Mathematical Analysis, Goursat-Hedrick, vol. 1. 
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Ab alternate derivation of (3.10) may be given as follows. The right-hand 
member of (3.8) represents the volume F under the bell-shaped surface 

(3,13) 2! = 

andsofrom (3.8) we have r(|) = F^^^. 

Since (3.13) is a surface of revolution 
we may take as the element of vol- 
ume a cylindrical shell of radius r, 
thickness dr, and height z. Then 

dF = 2irrdTZ = 2^re”'^“dr 

V = 2w J" e-^^'r dr = r 



Fig. 10 


and consequently we obtain (3.10). 

3.3 Stirling’s Approximation. 

Tables of log nl, to seven places, are available in Glover^s Tables up to 
n = 1000. However, it is often convenient to replace n! in a formula by an 
expression more amenable to algebraic treatment. The most widely used ex- 
pression is Stirling’s: 

(3.14) nl ^ 

Actually this is only the first term in an asymptotic series ^ 


(3.15) nl = 


I ^ — L. ^ — I — 

^ 12n ^ 288712 


+ 


or equivalently, taking the logarithm and using the series for log (1 + x) = 
X — x2/2 + xV3 — • • * , 

(3.16) logn! = Jlog2T+ (n + i) logn - n + + • • • 

The expression on the right starting at l/127i is not a convergent series, but 
the early terms decrease rapidly for values of n larger than, say, 5, and the error 
made by stopping at any term is less than the magnitude of the next following 
term. 

We will establish the formula in the following form, which is quite sufficient 
in practice. 

(3.17) log n! = I log 2x 4- (ti + I) log n — 77 -f- 
where 0 <*c«3„ < l/127i. 

log n! = log 1 + log 2 -f- log 3 + • • • + log n 
= log fc 

l ;-2 

== sum of shaded rectangles in Figure 11 
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Each rectangle is made up of the area under the curve y = log r, plus a tri- 
angular piece JS/Sr, minus the area RVTU between the included portion of 
the curve and its chord. 



Thus for the rectangle between — * 1 and fc, 

the area = I logxdx + ||log k — log {k — 1)] — c* 
Jk-l 


where 


CA: = y* log xdx — |[log k + log (k — 1)] 
since I logxdx — x log x — x. Hence, adding up for all the rectangles, 


/ 


logn! 


-jr- 


log X dx + |[log n — log (n — 1) + log (n — 1) 
- log (n - 2) + h log 2 - log 1] - 

*-2 

= n log ri — n + 1 4- i log n — 

00 

We can now prove that converges to a finite quantity a. For 


Jk-2 
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on expanding the logarithms of numerator and denominator. Therefore 


_ 1 I 1 I 

3(2fc - 1)2 5{2k - 1)^ 


so that 

0 < €i < 




1 , __i + 

^ {2k - 1)2 ^ {2k - 1)^ 


12yb(if - 1) 


00 I r 1 in 

2 A(A: - 1) “ % ^ 

00 

so that ^6* < ^ and so is finite (= a). Therefore 
2 

n 00 

'^€k = a — ]^€A: = a Oln 


where 


^ 1 ^ 

^ ^ ^ 12k(k - 1) 12n 


Hence 

log 7i! = (n + f) log n — n + 1 — a + Wn 
= log c + (n + I) log n — w + Wn 
where log c = 1 — a, so that 
(3.18) n! == 

It remains to evaluate the constant c. 

By making use of Wallises formula (see Example 4 of § 3.5), 


£ = r 1 r 2 * 4 > 6 - ♦ 2n J 
2 n-^oo 2n + 1 Ll ■ 3 • 5 • • • (2w - 1)J 

_ u 2-injy 


[(2n)!]2(2w + 1) 

and substituting for the factorials from (3.18) we obtain 
2 ““ iToo c2(2n)^"+i6~^V"2«(2n+l) 

““ i™ 2(2n + 1) “■ 4 
Therefore c = V2 t. 

3.4 The Beta Function. The definite integral 


%,n) = 


"^(1 — x)””"^ dx 


is called the Beta function of any two positive numbers m and n. Another 
useful form is 
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m 


(3.21) 


B(»i, n) 


-r 


6 dd 


which is obtained by letting x = sin^ 6 in (3.20). 
II we let X = 1 -- yj (3.20) becomes 


1 , «) = jT (1 - 




= J (1 — dx 

== B(n, m) 

Therefore, m and n may be interchanged. 

If we let X = (1 + yy\ (3.20) becomes 

/O OON _ /*“ r-^ dy 


(3.22) 


B(m; n) = j 

JO 


(1 + 


and here also, m and n may be interchanged. 

A relation between the Beta and Gamma functions may be obtained as 
follows. From (3.5) we may write 


r(n)r(m) 


X ao ^00 

/ ^2m-lg~y2 

= 4jf a;2»-ly2m-le-(*=+»») da; dy 


y. 2 (m+n-i)g-r« sin^”^"*^ ^ cos^*^"^ 6 T d$ dr 


' ^ cos^’*'"^ ^ I dr 


Since the region of integration is the first quadrant of the xy-plmej we have, 
upon changing to polar coordinates, 

/ r/2 ^ 

I ^2(m+n-l)g-rS gij^2m-l ^ COS^*^"^ 6 T dd dv 
y^ir/2 roo 

= 4 I gia 2 m-i 0 cos^”-^ Odd j dr 

= B(m, n)r(m + n) 
by (3.21) and (3.5). Hence 

(S..3) 

3.5 Reduction to Gamma and Beta Functions. By appropriate changes of 
variables many of the integrals that occur in statistics may be evaluated by 
expressing them in terms of Gamma and Beta functions. 


(3.23) 


Example 1, Prove that 
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Solution. This integral may be written 






Bv the substituliofi 


this becomc's 


*=2ar. dW=^dx 


1 /2ir®\W2 

2 ('tv ) Jo 

Example 2. Determine k so that 


1 /2a2\W2, 




^ g-.V.V2<r'(52){;N^-3)/2 ^(^2) = 1 


Solution. By the substitution 


.. -v 2cr^ j 

~ 2<r»’ d(s) - ^ dx 


this becomes 


'W Jo 


a;(iV'-3)/2e--* d!a; = 1 


Example 3. Determine so that K f (1 + dz = 1. 

J— 00 


Solution. By the substitution z ~ tan ^ this becomes cos”* 0 d$ where 

m — N — 2. From Problem 9 below we find that 


, (m±± 1\ _ 

A 2 ^2/ 


m + 1\ ^ /I 


'(=^) 


whence 


^(f) 


Example 4. Prove Wallises formula j equation (3.19), 

Solution. From Problem 9 below, 0dO == iB[i(w + 1), i]. Since, in the in- 

terval 0 to ir/2, 0 < sin ^ < 1, we have 

J f*ir/2 pvf2 /'7r/2 

sin®”**"^ 6 dd > \ sin*”* e dd > \ sin*”*’*'^ B dS 

0 Jo Jo 

or B(m, J) > B(m + 4, i) > B(m + 1, ^). Hence, by (3.23), 

T(m) T(m + j) T(m 4“ 1) _ m T(m) 

T(m 4- J) ^ r(w + 1; T(m +1) w *4 i P(»^ 4 i) 

The ratio of the two extreme members of this inequality is m/(w 4 i) and so 1 as 

m — > 00 . Therefore the ratio of the second to the third also — > 1 as m — > oo . So 
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m 


hm 

m’-* X 


(to + 



r(TO + h) 
r(m +■ 1) 


J 


- 1 


Now if m is a positive integer, 


Therefore 


r(m + i-) 
r(m 4- 1) m 




1 (2m - l)(2m - 3) • 1 /- 

ml 2”* ^ 

1 2m(2m — l)(2m — 2) * • 1 

ml 2^"^ml ^ 

(2m)!Vx 

22 '^( m !)2 


(m^4Ji)K2 m) j V 


1 


which is equivalent to (3.19). 


3.6 Incomplete Beta and Gamma Functions. The integral 
(3.24) rx(^) = ^ dx 


is called the incomplete Gamma function. Similarly 
(3.25) B*(w, n) = j' ir^~^(l — xy~'^dx 


is called the incomplete Beta function. Both (3.24) and (3.25) are useful func- 
tions in mathematical statistics and they have been tabulated ^ by Karl 
Pearson and his staff at the Biometric Laboratory, University College, London. 

Obviously T^(n) = T{n), and Boo(m, n) = B(?n, n), so that the ordinary 
Gamma and Beta Functions are often called complete. 

Note that Pearson^s tables give the ratios 


and 


J(u, n) == where x = uVn + i 


/*(m, n) 


Bxjm, n) 
B(m, n) 


which are usually more useful than the incomplete functions themselves. 


Problems 


1. Show that 

2. Prove that 


r(n 4- 1) (n\ 

r(r + l)r(n - r 4“ 1) Kr) 

£ 4>{f) dt *= 1, where ^(0 ~ 
0 


3. If f{n) « n^^2B(n/2, |), prove that lim /(n) « (2ir)^/*. 
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X co 

+ xl2y dx by transforming it into a Gamma function. 

Hint. Let cy — 1 x 12 and determine c so that = ke~^. 

Ans. (e67!)/3(60. 

6. Evaluate e’^{x — 6)^ dx. Ans. 

f ree 

given that J dx = (Tr)^/*, 

Hint r(|) - ir(i). 

7. Find the difference and the ratio between the exact value of 10 ^ and the approximate 
value obtained by using Stirling's formula. 

8. Using (3.23), show that '' 

-(f) ^ 


- 1), i] 

/•x/2 /'•jr/2 

9. Prove that cos” 9 iff = = ^B[(ot + l)/2, i]. Hint Use (3.21). 

10. Show that X f” (1 + z 2 )-Wiz 2 dz = l/(iV - 3), where 1/K = B[(iV - l)/2, i]. 

J—OO 

11. Show that kC (1 + < 2 / 7 i)-”(n+i)/ 2^2 dt = n/(n — 2), where l/K = n^f^B(nl2, J). 

J— 00 

12. If p(s, x) = prove by means of Stirling's approximation that the maximum 

value of p(Sf x) for any x tends, ass— »oo,tol/V 2Trspq. Hint. The maximum is given for 
sp q < X < sp p. See Problem 12 following § 2.8. 

13. Prove by using Stirling's approximation that the probability of exactly x successes 
in s trials, of an event with the constant probability ^ of success in a single trial, is 

to terms of order l/s. 

14. Prove that if /(w, x) is the Poisson exponential function and if 

PC 

Qn{m) = x), then 

Hint. Writing Taylor's theorem in the form 

/(a +h)= m + h/'(a) + • • ■ + /«(a + «h)(l - 0-^ dt 


put/(x) = e*, a = 0, = m. Hence show that 

n-l 

1 = 2 ' 


1 






0 


'^du 


16. Using equation (3.22), show that 


B(w, w) = ( 
JO 




: dx 


Jo (1 + 

Hint Divide the range of integration in (3.22) into two parts, 0 to 1 and 1 to oo, and in 
the second part put y =' 1 /x. 

16. Obtain an asymptotic series for the computation of the integral I e“*'* dy, for large 
values of x. (See next page.) 
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m 


= _ jf ^-J/2 

Wnte the last integral as £ (l/y) (ye-ifi) dy and integrate by parts repeatedly. The result is 
£e J dy - !■ + ■S.+iJ 


2”a:2» ^ 


iS»+l = (-l)n+ll.'^ ••(2w + l) 


Tn+i decreases as n increases, as long as n < xK Show that 

I ifthr^Ip^nto “*'*^*“ “ “ 'Prohlem 16 to evaluate £<i,(,t) dt, and check with Table 
18 . Use Stirling’s approximation to verify the statement in § 2.16 (see equation 2.67) that 


lim kn(x, s) « 


;T(r±^w 
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CHAPTER IV 


DISTRIBUTIONS OF TWO OR MORE VARIABLES; MOMENT-GENERATING 
FUNCTIONS; THE LAW OF LARGE NUMBERS 


4.1 Joint Distributions of Two Variables. Definitions of a frequency func- 
tion of one variable and the associated notion of probability were given in 
§ 2.3. Corresponding definitions will now be given for an arbitrary probability 
distribution of two variables. The continuous variables x and y have the 
joint frequency function f(Xj y) if the double integral of f{x, y) over a region 
of the {Xj y) -plane measures the frequency of occurrence of pairs of values 
(x, y) in that region. It will be understood that /(x, y) is continuous^ single- 
valued, and non-negative. If values of (x, y) are restricted to a finite region 
we define /(x, y) to be identically zero outside that region. In the extended 
region of definition, we set 


(4.1) 



/(x, y) dy dx ^ I 


Geometrically, this means that the volume under the surface represented by 
z = /(x, y) is unity. ■ Then /(x, y) dy dx is the probability that simultaneously 
X lies in the interval (x, x + dx) and y lies in the interval (y, y + dy). Con- 
sequently, 

(4.2) XT f{x, y) dy dx 


represents the probability that x lies between a and b at the same tims that 
y lies between c and d. 

We shall distinguish between two cases; (a) when the variables are inde- 
pendent in the probability sense, and (b) when they are not. Let the prob- 
ability be g{x) dx that x occurs in dx for all y’s. Then integrating over all 
admissible values of y, we have 

(4.3) g{x) dx = dxj f(x, y) dy 

It- is clear that the integral in (4.3) gives g{x) because the relative frequency of 
occurrence of a; in any interval (a, b) is the relative frequency of pairs {x, y) 
belonging to the strip of the r^z-plane for which a <x <b, and this is 



fix, y) dy dx 


I 


& 

g(x) dx 
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Similarly, if /i(y) dy is the probability that y occurs in dy for all assignments 
of X, we have 


(4.4) 


h{y) dy 




y) dx 


In accordance with convention we shall call g(x) and h{y) the marginal dis- 
tributions. 

The independence of x and y is characterized by the following 

Definition. The variables x and y are independent when fix, y) = g{x)h{y). 
If fixj y) cannot be expressed identically -as the product of the marginal distribu-^ 
tions, then x and y are not independent. 

4.2 Moments. Let the general product moment about the common origin 
of X and y be defined as follows: 


(4.5) 




= r Cfix, 

*/ — 00 


y)x^y^ dy dx 


assuming that the integral exists. 
If w = 0 and n == 1, we have 


(4.6) 


J'oi 


- r r M 

^—00 J—CG 


y)y dy dx 


Let/(x, y) be a fimction in which the order of integration may be interchanged. 
Then voi becomes ' 

i;ix f{x,y) dx^ydy-= J” h(y)y dy 

which is the expected value of y. Similarly, the expected value of x is 

^eo /*D0 

(4.7) vio = I I f{x, y)x dydx = I g(x)x dx 

We will now define the general product moment about (rio, m) as follows: 


(4.8) Mm. = f I (x — vio)’“(y — yoi)’^(x, y) dy dx 

J—<Xi ^—oo 

When m a* n — 1, we have 

(4.9) Mn= f I (x — vio)(y — voi)f(x, y) dydx 

which is called the comriance of x and y, 

Whm m *= 2 andl n *= 0, we have the variance of x, 


A*20 


/S> 

I I (x — vio)’‘f(x, y) dydx 
(x - Vio^gix) dx 






(4.10) 
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(4.11) 


M02 


^QC 

J — tx> J—ca 


-f: 

— frt 

— (Ty 


(y - myfixj y) dy dx 
(y - PoiYhiy) dy 


It is left as an exercise for the student to show that 


(4.12) 


fMii = 


noPoi 


Vll 

L/i20 = 3^20 — PIO" 

The coefficient of correlation between x and y, denoted by pxy, is defined by 

(4.13) 

(T %(Jy 

Note that if pxv = 0, a; and y are said to be uncorrelated. If x and y are in- 
dependent they are uncorrelated, but the converse is not true. It is possible 
for pxy to be zero even though x and y are highly dependent on each other. 

4.3 Discrete Variables. If the variables x and y are discrete, we define a 
joint distribution function F{x, y) by 

(4.14) F(x, y) = ]^ y,) 

H Pj 

where the summation is over all values of Xi < x and oHyj <y. 

If X is discrete and y continuous, 

(4.15) Fix, y) = 'X C f(^i> y) ^y 
and similarly if y is discrete and x continuous, 

(4.16) F(x,y) = X r 

VJ J-X, 


Finally, if x and y are both continuous. 


(4.17) 


F(?:, y) 


rx ry 

/ / Ku, 

tJ—aO ^—00 


v) dv du 


All these cases can be included in one formula by writing the joint function 
as a Stieltjes integral (§ 2.3) 


(4.18) 
where 

(4.19) 


{x,y)^ r r 

^—00 ^—00 

££ 


dF(u, v) 


dF(u,v) = 1 
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Then 

(4.20) 

(4.21) 


Vmn 


fJ>mn 


^00 y^co 

I I 

J—Vi J—Vi 

IT 

J-OO */-•» 


dFix, y) 


(x - vio)’”(y — j-oi)" dF(x, y) 


The variables x and y are independent if and only if dF(x, y) s dG{x) dH(y) , 
for all values of x and y. 


Example 1. li X and Y are independent variables with distribution functions F{x) and 
G{y) respectively, find the distribution function of the variable W = X + F. 

This distribution function, say H{w)^ is 
defined as if (ui) = Pr[TF <v>] = probability 
that a point in the XF-plane lies in the 



shaded region below the line X + F = to. 
For any given y, the probability that 
X <w — y is F(w — y). Hence the re- 
quired probability is 

r F(w - y) dG(y) 

J—OQ 

If X and F have continuous frequency 
functions /(x) andg(t/) respectively, we find, 
X l^y differentiating, that the frequency func- 
tion of W is 

(4 22) Hw) = rf(w-yMy)dy 

J—oo 


4.4 Joint Distribution of More than Two Variables. The notation and 
definitions of § 4.3 naay be extended to any finite number of variables, although, 
of course, there is no longer an intuitive geometrical picture. If Xij X 2 , • • • Xn 
are the variables, they are independent (see § 1.6) if and only if 

(4.23) dF(zij X 2 , * • ^ Xn) = dFi{xi) dF 2 {x^ • • * dFn{xn) 

If all the separate distributions possess frequency functions, and if there is 
a joint frequency function, this relation is 

(4.24) S{Xi, X2r * ' = fl(Xl)f2(X2) * • *fn(Xn) 

4.5 Some Theorems on Expectation. The expectation of g{x) has been 
defined, equation (2.12), as 

F{g(x)} = J g(x)dF(x) 

The following theorems are readily established from this definition, remember- 
ing that 

(4.26) ; J dFix) = 1 
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Theorem 4.1. The expected value of the product of a variable and a constant 
is the product of the constant and the expected value of the variable, 


(4.26) E{cx) = cE{x) 

Theorem 4.2. The expected value of the deviation of a variable from its ex- 
pected value is zero, 

(4.27) E{x - vi) = 0 

Theorem 4.3. ^ If x and y are variables with a joint distribution function 
y)> Ibe expected value of x y is given by 


E(x + y) 


(4.28) 


= f f (x + y) dF(x, y) 

J-cc J^eo 

X QO ^00 Aoo 

I X dF(x, y) -h I I 

00 J—oo ^—co */-« 


y dF(x, y) 


= Eix) + Eiy) 


That isj the expected value of the eum of two variables is the sum of their expected 
values. 

This is true whether the variables are independent or not. By an immediate 
extension, we have 


Theorem 4.4. The expected value of the sum of n variables is the sum of their 
expected values^ 

(4.29) 


E{Xi + 0^2 + • * • + Xn) = 

Theorem 4.5. If x and y are two independent variables we havCj by § 4.3, 
Eixy) = r f xy dF{x, y) 

J -00 J -CO 


(4.30) 


= r [ xydG{x)dHiy) 

f ~oo J—co 

= r xdG{x) r ydH(y) 
J—oo J—OO 


E{x)E{.y) 


Thai iSj the expected value of the product of two independent variables is the 
product of their expected values. 

This theorem can also be extended to n variables. 


Theorem 4.6. The expected value of the product of n variables is equal to the 
product of their separate expectationsj provided the variables are independent* 

E(XiX2 • • • aJn) - E(Xi)E{X2) * * * E{Xn) 


♦ See § 1.6. 
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Theorem 4.7. The expected value of the product of deviations of two inde- 
pendent variables from their respective expected values is zero, 

E(x — viQ)(y — z'oi) = 0 
This is an immediate corollary from Theorem 4.5. 

Theorem 4.8. If x and y are any two variables^ the expected value of the product 
of deviations from their respective expected values is equal to px^vr^fTy, that is, 

(4.31) E(x Pio)(y — j^oi) = Pa>y(rxcry 
For, by definition, pn = E{x — vio){y — ^'oi) and pxy - 

Theorem 4.9. The variance of the sum of a number of variables is equal to 
the sum of the variances, if the variables are pairwise uncorrelated'. Thus, if 
y ^ Xi + X 2 + * • • + Xn where the variable Xt has the expected value 

(4.32) Var {y) - E{y ^ 

= E[^{xi — + 

- i i9^j 

- 2 «<« 

i 

4.6 Moment-generatiag and Characteristic Functions.* If we put 
gix) = e*® in (2.12) we define a function of h, 

(4.33) M(/i) = J°° e^^dF(x) 

provided that the integral exists. M(h) is called the moment-generating fum- 
tion (mgf) of the distribution of x, since if is expanded in a power series and 
if moments of all orders exist, 

(4.34) M(h) = r (l+hx + ^ H ) dF{x) 

~ 1 + hv-L + ^ ^2 + • • • 


SO that the coefficient of h^/r\ is the rth moment about zero. It follows that, 
if we differentiate r times with respect to h and then set A == 0, we obtain the 
rth moment vr, that is, 

(4.35) Vr “ 

For the Cauchy distribution, as we have seen, vi does not exist and hence 
there is no mgf. But we can avoid such difficulties of convergence by suppos- 
ing that A is a pure imaginary number, that is, by writing g{x) = where 
t is real and % = The function 


(4.36) 


(7(0 = = J*" e*^dFix) 


alwajrs exists for a given distribution since [ c*** j = 1. 
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C(t) is called the characteristic function (qf) at the distribution and is, of 
course, in general a complex-valued function of t. Obviously 0(0) = 1, 
whatever the distribution. 

Analogously to (4.34) and (4.35) we have the relations 

(4.37) at) = 1 -b ffoi - 1/2 - |j»«a -h • • • 

and 


(4.38) 


= C'W(O) 


For the Cauchy distribution, 


(4.39) 


at) 



e'*” 
1 + 



L 


cos ix 
I 1 + x^ 

— ab ' 


dx 


since == cos fx + i sin tx and the function (sin tx)/(X + being an odd 
function of x, vanishes on integration between —oo and +oo. It may be 
shown that (4.39) reduces to when t > 0 and to when t < 0, Therefore 
C(t) = is the cf of the Cauchy distribution. It should be noted, however, 
that the first derivative of C(t) is discontinuous at t ~ 0, being — 1 at ^ = 0+, 
and +1 at i = 0—. 

4.7 Some Examples of Moment-generating Functions, (i) The unit dis- 
tribution. This is a highly special discrete distribution in Which the total 
frequency is concentrated at the one point x ^ 0. The distribution func- 
tion is 


(4.40) 


€(x) == 


/ 0, X < 0 

\ 1, a; > 0 


and so has a single step of size 1 at a; = 0. The mgf is 

(4.41) M(h) = p de(x) = 6“ = 1 

since the integral here reduces to a single term. 

(ii) The binomial distribution. Recall that the probability of x successes 
in s trials, with constant probability p of success in a single trial, is 


(4.42) 


p(x, s) = a; = 0, 1, 


Let us now introduce s auxiliary variables Zi, Z 2 , • • • each of which can take 
only the two values 0 and 1, with probabilities q and p respectively. The 
number of successes will then be given by 

(4.43) X = zi + Z2 + • • • + z, 

since each success contributes a 1 and each failure a 0 to this sum. 

Each Zt has a distribution function with two steps, one of g at 0 and one of 
p at 1. Therefore its mgf is 
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(4.44) 


M^(h) 




The mgf for x is 


M{h) 




c^q + = $ + pe^ 


= J" e'“' dF(zi) ‘ ‘ J ^ 


e*" dF[zx) I dF{z,) 

MiQi)Mi(h) ■ • • M,{h) 

since the various trials are supposed to be independent (see equation 4.23). 
Hence, since M%()i) is the same for each 


(4.45) MQi) == (g + pe^Y 

From this we can obtain the various moments by means of (4.35). The 
student should calculate vi to 1^4 and p 2 to pa as an exercise. 

The expectation of Zi is Og + Ip = p. Hence from (4.43) and Theorem 4.4 
wc obtain immediately E{x) = sp. 

The variance of Zi is (0 — p) 2 g (1 — which simplifies into pg. 
Hence by (4.32) the variance of x is spq, 

(iii) The normal (Gaussian) distribution. In terms of the standardized 
variable r the frequency function of the normal law is 


(4,46) 


4,(r) = (27r)‘"1^2e~rV2 


Therefore the mgf is 

(4.47) M(h) = (2x)->/= J‘° dr 


Putting 7^ = T — 7i, we have 

(4,48) M(h) = (27r)-i^2^^*/2 

For all values of h this may be written 



du = 


M(h) = 1+1 + ^ + 


233 ! 


+ ... 


so that all the odd v^s are 0, and the even ones are given by 


(4.49) 


1^2 =" 1 
V 4 = 1 • 3 
V 6 — 1 • 3 • 5 
Vs = 1 • 3 • 5 • 7 


Because the expected value of r is 0, the coincide with the respective v's. 
(iv) The Poisson distribution. Since the Poisson distribution is obtained 
from the binomial distribution by letting p — » 0 and s—>oo while m = sp 
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remains finite, we should expect that the moment-generating function of the 
Poisson distribution would be given by 

(4.50) MQi) = lim fl - - + - e*T 

s— ». 00 L S ^ J 

— ^ — fn(l— 


That this passage to the limit is valid rests on a theorem to be found in 
Cram^r^s Mathematical Methods of Statistics, It states that, if a sequence of 
characteristic functions has a limit which, as a function of t, is continuous at 
t = 0, then the corresponding sequence of distribution functions also has a 
limit, and this limit is the distribution function of the limiting characteristic 
function. Since the characteristic function differs from (4.50) only by having 
it instead of h, and is clearly continuous at f = 0, the conditions of the theorem 
are satisfied. 

The moments of the Poisson distribution are therefore given by 
I = = m 

- m, etc. 

(v) The rectangular distribution 



(4.52) 


f(pi) 2ci^ a ^ X ^ Of 
f(x) =0, j a; 1 > a 


This distribution 
(4 53) 


has discontinuities at a: = ±a. 


M{K) 


i r 


2a 

1 


ah 


aW a%^ 

iir + “^ + 


for all real values of h. Hence the odd moments are all zero, as is obvious 
from the symmetry of the distribution. Also, since n = 0, 


(4.54) 



etc. 


4.8 Change of Scale and Origin. If the origin is shifted to the right by an 
amoimt a, which is equivalent to replacing a; by a:' = a; - o, the distribution 
function Fix) becomes Fix' -t- a). The moment-generating function is now 


M,ih) 



c*®' dFix' -j- a) 


(4.55) 
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— 00 


so that the effect is to multiply the moment-generating function by 

If the scale, that is, the unit of Xj is decreased in the ratio 6:1 which is 
equivalent to replacing the variable x by x' == 6x, the distribution function 
becomes f (x'/6). The mgf is 


(4.56) 



so that the effect is to replace h by hh. 

Hence for the normal distribution with frequency function 

(4.57) /(x) = — 

(r\^27r 

the mgf is 

(4.68) M(h) = 

since this distribution is obtained from the standardized normal curve by 
shifting the origin a distance to the left and decreasing the unit in the ratio 
0 -: 1. The characteristic fmction is 

(4.59) C{t) = 


4.9 Uniqueness Theorem for Characteristic Functions. It may be shown 
(see Cram4r’s Mathemattcal Methods of Statistics^ page 93) that a distribution 
is uniquely determined by its characteristic function. It is not true that a 
distribution is uniquely determined when all its moments are known, although 
examples to the contrary are rather special and complicated. 

An interesting dual relation exists between the frequency fimction and the 
characteristic function. If the frequency function /(x) exists for all x, except 
at most at a finite number of points, 


(4.60) 


m = 



e^^^f{x) dx 


and C (t) tends to zero as t tends to ± oo . 
Conversely, 


(4.61) 



e-*'*C(<) dt 


and f(x) is said to be the Fourier transform^ of C{t), 
For example, if C(0 == 
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/(a;) 


1 r 


2t 

J. 

2'ir ' 




'X 


g— a;^/2o-* i K<y^+ta:/or)2 


Putting M = + ix/ff, dw = tr/dif, we have 


fix) = 


2x<7 

1 


,-x^lvr- J'‘ 


'oo+ix/(r 


Q-vP-n 


oo+teA 


g-xV2"^ lim 

ZTTO* 


X 


A-\-ixf<r 


dw 


A+ia^/cr 


Since the integrand has no poles in any finite part of the plane the integral 
around a rectangle joining the points A + —A + ix/or, —A, will be 
zero. Since also the integrand — » 0 when J. go along the two ends of the 
rectangle, thejntegral from —A + ix/c to J. + ix/cr is the same as from —A 
to Aj i,e,, 'n/2x in the limit. Hence 

1 


fix) = 


(T^2Tr 


p— a:2/2o-2 


as we should expect from (4.59) and (4.57). 

4. 10 Cximulants and the Cumulant-generating Function, Certain functions 
of the moments have been shown by Thiele,^ Fisher ^ and other writers to be 
of particular importance in sampling theory. These functions were first 
called semi-4nvariants and later, by Fisher, cumulants. They are usually 
(following Fisher) denoted by the Greek letter kappa (k). 

If the logarithm of the moment-generating function as a function of h is 
expanded into a power series which converges for some range of h containing 
the origin as an interior point, then Kt is the coefficient of h^/rl in this series, 
and log MQt) is the cumulant-generating function (cgf) K(h); that is, 

(4.62) Kih) = log MQi) = K,h + K 2 ^ + + ■■■ 


Since ilf (0) = 1, there is no constant term./ If we move the origin a distance 
j'l to the right, the mgf becomes M{h)e~^^ = Mi (A), where 


Mi(A) = 

since Mr is now the same as Vr. 


1 a- ^ a. 

1 ~p M2 2 1 1 g I “T* 


Since 


we have 


1 


dKiih) 
dh Miih) 


(4.63) ^1 + M2 ^ + Ms ^ 


dMiQi) 

dh 

¥ 


+ Ksh + Ka ^ + 
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and, on equating coefficients of like powers of h, 


(4.64) 


Kl = 0 
Kt = 

Kz = M3 
/C4 + 3m2/^2 = M4 

K5 + 6m2/C3 + 4m3«2 = M6 


Solving these equations for the kappas, we get 


(4.65) 


K2 = M2 

Kz = M3 

/C4 = M4 "" 3m2^ 

KZ = fJLS 10m3M2 


The only effect of the translation on the kappas is to change ki from vi to 0. 
Hence we can complete (4.65) by adding, for the original distribution, 

(4.66) Ki = vi 

Some examples of cumulants will now be given. 

Example 2 (Normal distribution), 

K(h) - log 

= vih •+• 

Kl = VI 
K2 = 

and all remaining cumulants are zero. 

Example 3 (Poisson distribution). 

M(h) = 

K{h) = m(e» - 1) 

= w(fe4-|j + !; + •••) 

Hence each cumulant has the value w. 

4.11 Additive Property of Cumulants. The principal property of cumulants 
is expressed in the following theorem: 

N 

Theorem 4.10. If L = is a linear function of N independent vanates, 

the cumulant-generaiing function for L is given by 

(4.67) KQi) = 
where Kj(h) is the cgf for Xj, 

This follows at once from the corresponding property for moment-generating 
functions (see Problem 6), 
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jvr 

(4.68) MQi) = JjM,(c,A). 

J-1 

when we remember that KQi) == log MQi). 

Thus, if X and y are independent variables, the cgf iorx + y is the sum of the 
cgf for X and y separately. 

Some important reproductive properties of certain distributions follow 
from Theorem 4.10. 


Theorem 4.11. If Xi, X 2 , * • • xn dre independent and normally distributed 
variables with means mi, m 2 , * • • mjsr and variances 0 - 2 ^, • • • crjv^, and if 

N 

L = is a linear function of the Xj, then L is normally distributed with 

i-a 
mean * 

N iV 

^Cjmj and variance 

Proof: The cgf of the variable Xj is 
Kj(h) = 

Hence, for L, the cgf is 

(4.69) K(K) = h%im, + 

Therefore 

vi = and /i 2 = 0-2 == 

all the higher cumulants being zero. Hence, by the uniqueness theorem, the 
distribution of L is normal. 


Theorem 4.12. If Xi, X 2 j • • • xn are independent normally distributed vari- 
ables with means mi, m 2 , • • • mN and with a common variance and if 

N 

L = (1/v^) then L is normally distributed with the same variance. 

This is an immediate corollary of Theorem 4.11, with c, = 1/Vi^ and 

<7j = (T. 

Theorem 4.13. If the independent variables Xi, 2 : 2 , * • * xn have binomial dis- 
tributions with a common parameter p, but with s = si, 82 , •• • sjv*, respectively, 

N 

and if X ^ ^ ^ binomial distribution with parameters p wnd 

^»i 

s - 2 ... 

y-1 


* The word *'inean,^’ as mentioned at the end of § 2.5, is often used as synonymous with 
expectation/^ since the mean of a random sample tends to the expectation as the size of the 
sample increases. 
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Proof: 

K,ih) = log (g + pe’^Y’ 

= s, log (q + pe*) 

by equation (4.45). Then 

(4.70) KQi) = log {q + pe>') 

= S log {q + pe^) 

Theorem 4.14. If the independent variables Xj, X 2 , * • • xn have Poisson dis- 

N 

tributions with parameters mi, m^, • • * niN, and if X = then X has a Poisson 

N 

distribution with parameter M = ^mj. 

Proof: 

KjQi) = m^{e^ — 1 ) 

Therefore 

(4.71) K{h) = ^mle^ - 1) = M{& - 1) 


Theorem 4.15. The Poisson distribution tends to the normal distribution 
as m 00 . 

Proof: If we change the variable to the standard form r = (a; — m) the 
effect on the mgf is; by (4.55) and (4.56), to multiply MQi) by and then 
to replace h by 
Hence the mgf is 

MQi) = — 1)] 

and the cgf is 


+ m{e^'^ — 1 ) 

r -1/2^' . 


Hence #ci = 0, k 2 = 1? ^ 2. The limit of /Cr as m--^ oo is 

therefore 0 for r > 2, so that the cumulants tend to the values for the normal 
curve. 

4.12 Sheppard^s Corrections. A continuous distribution may be specified 
by the frequencies corresponding to each of a set of class-intervals covering the 
range of the variable. If moments and cumulants are calculated, assuming 
that the frequencies are concentrated at the mid-points of the respective inter- 
vals, there will be, in many cases, an en-or due to this grouping. In certain 
circumstances it is possible, as Sheppard showed, to allow for the error of 
grouping. The corrections, as applied to the cumulants, are as follows: 

If k/ is the rth cumulant of the ungrouped distribution, and Kr the rth cmnu- 
lant of the grouped distribution with class interval c, the corrected cumulants 
are 
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(4.73) 


f Kr' = Kr, T OCld 
i , Br 

\ Kr — Kr C, V eVeil 

L r 


where Br is the rth Bernoulli number. 

This number ® is defined as the coefficient of t^/rl in the expansion of 

t{e^ - 1 )-“^ 

Thus, Bo = 1, Bi = — I, ^2 = h Ba = — Bo = etc., and all the 5's 
of odd subscript beyond 1 are zero, so that 


(4.74) 



Ka 


Ka + 


120 



etc. 


For a proof of the validity of these corrections, under the rather restrictive 
conditions in which they apply, the reader may refer to Kendall’s Advanced 
Theory of Statistics, VoL L® 

These conditions are that the frequency function f(x) is continuous, 
bounded, and tends monotonically to zero in the directions in which the range 

is infinite, that the absolute moment of order r, J* \x^ [ f(x) dz, exists, and 

that the limit of x^[d^f(x) /dx^"] is 0 for all j up to and including r when x —» + oo . 
If the range has* finite terminal points, f{x) and its first m derivatives must 
vanish there, m being a number (of the order of r) such that the mth derivative 
otf(x) multiplied by 0 "”+^ is small at all points of the range. This last condition 
means that the distribution must have contact of a sufficiently high order with 
the X axis at the ends of the range. 

Sheppard^s corrections apply also in a rather different situation. We think 
of the true continuous distribution as stretching over an unknown range, with 
an interval-grid superimposed on it in a random manner, and determine the 
corrections to be applied to the grouped moments so as to bring them on the 
average nearer to the true moments. The condition about high order contact 
at the ends of the range is not now essential, but if it is not satisfied (if, for 
example, the distribution starts off abruptly at the beginning of the range) 
the grouping error may vary a great deal with the random position of the 
intervahgrid. Sheppard^s corrections will still apply on the average but may 
not be at all satisfactory in some particular positions of the interval-grid. 

In the following discussion it is assumed that the random distribution of the 
interval grid is rectangular. That is, if Xx is the nearest mid-interval point to 
any given x, then x where €» ranges from —c/2 to c/2 with a con- 
stant probability density, independent of the distribution of x. This is a 
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special assumption, but it places no limitation on the frequency function of 
X itself. 

If K(h)j K'(]i)j K^'Qi) are the cumulant-generating functions for Xi, x, and 
respectively, 

K(h) = K'Qi) + K''ih) 

Now by (4.53) K^'Qi) = log sinh 9/2 — log 6/2 where 6 = ch. 

By the definition of the Bernoulli numbers Br, 


(4.75) 2 ^ - 1)-^ = I (coth I - 1) 

SO that 



Dividing by 6^ we obtain 




rl 



1 

6 


Integrating from 0 to 6j we have 


BrS^ , • 1 , ^ 1 ^ 

Z— ! =^°SSinh2-log2 

the constant log 2 being introduced to make both sides tend to 0 as 0 —> 0. 
Hence 

(4.76) K'{h) KQO 

whence 



as stated above. 

In practice, Sheppard^s corrections are usually applied, not to theoretical 
distributions, but to sample data. (See, for example, a discussion in Part One 
of this work.*) It is by no means certain that the use of these corrections will 
improve the estimates of the moments of the parent population, but very fre- 
quently it will do so. Since, for small samples, the sampling errors of the 
moments will greatly exceed the corrections, it is not w-orth while applying 
Sheppard's corrections unless the sample consists of at least a few hundred 
individuals. Moreover, the corrections should not be used unless the dis- 
tribution tails off gradually at both ends. 

4.13 Orthogonal Linear Transformations. Let the n variates Xi be sub- 
jected to a linear transformation 

(4.77) yi ~ (i, j = 1, 2, • • • ^) 

* Kenney, J. F. and Keeping, E. S., Mathematic-^ of StatisticSf Part One, D. Van Nostrand 
Co., Inc., New York, 1954, 



Sec. 13 


Orthogonal Linear Transformations 


83 


where Ci, are arbitrary constants. If these constants are so chosen that 

( 4 . 78 ) 


the transformation is said to be orthogonal. 

If the determinant of the coefficients in (4.77) is multiplied by itself (\vith 
rows and columns transposed), it is easily seen* that, because of (4.78), 


Cll Ci2 * • • Cin 


Oil C 2 I * • * Cnl 


1 0 ••• 0 

C 2 I C 22 * * • C2n 


C 12 C 22 * • * Cn2 


0 1 ■ 0 

• 


* 


0 0 •• • 1 

Cnl Cn2 * * * Onn 


1 Cin C2n * * * Onn 




so that the determinant itself is equal to ±1. By multiplying the determi- 
nants in the reverse order, which will obviously give the same result, we obtain 
the relations 


(4.79) 


n 

^**l 



if t = j 
if i 7^ j 


The proof requires a knowledge of matrix theory, given in Chapter X. 
If C is the matrix of the coefficients and C its transpose, then by (4.78) C’ 
is the inverse of C. It follows that C = C' C, which means that these 
products are equal, element by element. 

Hence in the determinant | ( the sum of the squares of the elements in any 

row or any column is equal to unity, while the sum of the products of corre- 
sponding elements in any two rows or any two columns is equal to zero. 

It follows that 

(4.80) 

<=1 


The Jacobian of the y^s with respect to the x^s will be the determinant 
I Cl, |. By changing the sign of one of the y% if necessary, we can ensure that 
the determinant is equal to 1, and hence 


(4.81) 


(Vh • 


II 

* -XrS 

= 1 

\^1; • 

* -xj 

Wh • 




Geometrically, this transformation is equivalent to a rotation of the coordinate 
axes about the origin. In such a rotation distance from the origin remains 
invariant. 


Theorem 4.16. If the variates Xt are independently and normally distributed 
about zero with the same variance cr^, the yt are also independently and normally 
distributed about zero with variance cr^. 

* The detenninants are multiplied according to the rule for matrices (§10.6). For fuller 
details a book such as Aitken's Determinants and Matrices (Oliver and Boyd) may be con- 
sulted. 
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Proof: The joint frequency function of the t, is 

fixi, %, • • • Xn) dxi' • • dxn = dxi • • • dxn 

Therefore, by (4.80) and (4.81), the joint frequency function of the y, is 
Kyi> 2/2 •• • Vn) dyi - ' • dyn = (2ir(r-)-’’'e dyi - ■ • dyn 

But the right-hand side is the product of n factors like 

{2T(r‘^)-^l‘^e-yP^ dy, 

and so the y% are independent and normally distributed, with variance (P. 


Example, 

(4.82) 


The transformation 

yi = (Xl 4- ^2 + • • * + Xn)l^ 

2/2 = (Xl ~ X2)/V2 

2/3 =r (xi H- ^2 ~ 2x^1^/ 6 

2/4 = (xi Xo xz — Zxi) / ^/V2 


yn ^ [Xi + X 2 ' -h Xrv-i - {n - l)xn]/^n(n — 1) 


is orthogonal. It is easy to see that in each y the sum of squares of the coefficients is 1, and 
that, for any two different y's, the product of the coefficients pair by pair is equal to 0. 


4.14 The Bienayme-Tchebychefif Inequality. Let X be a random variable, 
taking real values Xi, Xa, • • • Xn with respective probabilities p(Xi), p{X^j 
• • • p(Xn), or, if X is continuous, taking values between X and X + dX tvith 
probability p(X) dX, 

What we mean by saying that X is a random variable is merely that we can 
associate in some way with every value of X a real, non-negative number 

X 30 ^00 

dP{X) == 1, where this integral means j p(X) dX, 

00 J-QO 

if X is continuous, or ]^p(Xz), if X takes on the discrete values Xi- 
The expected value of X is 

= J'xdPiX) 


and its variance is 


(X-vxYdP(X) 


Let c be any assigned real positive constant. The probabiKty that 


X- >ci 


dP{X), integrated over the set s of all values of X satisfying 


the given inequality. Clearly 


*The terms ‘^random variable,” '^chance variable,” ” chance quantity,” '^statistical 
variable,” "stochastic variable,” and "variate,” are used synonymously in the literature. 
Symbols x and X will both be used at times in this book to denote random variables. 
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> 


and we have 


I 


dP{x) 


>c^- f dPiX) 

»/w 


Theorem 4.17. 

(4.83) Pr{\X-v,\>c]<^, 

which is the Bienaym^-Tchebycheff inequality. It asserts that whatever the 
nature of the distribution (if it has a finite variance) the probability that a 
random value of X will differ from its expected value by as much as B times 
its standard deviation is not more than 1/5^. 

4.15 The Weak Law of Large Numbers {Bernoulli s Theorem). If x is the 
number of successes in s trials of an event with constant probability p, we have 
seen that the expectation of r is sp and its variance = spq. Therefore by 
(4.83) 

Pi{\x-sp\>c] <^<^2 
since pq is always < j. 

Let y = xjsj the relative frequency of success. Then 



or, putting t = c/Sj 

(4.84) Pr{l2/ - P I > ^} < 4^ 


Hence for any given positive number e, however small, we can always find s so 
large that Pr{| y — p\>t} < €, for any < > 0. This is the weak law of 
large numbers in one form. 

For example, if e = 0.001 and t = 0.01, we have to find s so that 
1/.00045 < .001. This means that it suffices to have s > 2,500,000. 

The theorem may be stated also as follows. 

(4.85) lim Pr{| x/s — p 1 > 5} =0 

for any assigned 6 > 0, which means that y — x/s tends in probability or 
stochastically to the value p as s increases indefinitely. Note that if .4* stands 
for the event \ x/s — p \ > 5 we have proved that the probability of .4 « is at 
most (455^)““^, and so 0 as s oo . But the probability of this event for 
some s > N is given by 

Pr(4.iv + +•••)< P(An) + PiAjsr-^i) -f • • • 

- iP (iV N + l + ■ ■ ') 

and, since the series diverges, this tells us nothing about the probability. 
However, there is a stronger form of the law of large numbers. 
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4.16 The Strong Law of Large Numbers.'^ Let xi, Xi, ■ • ■ Xn be independent 
random variables with the same distribution, characterized by 

K2 — b, K3 — C, K4, — d 


Ki = a, 


n 

If 2 / = we have by Theorem 4.10 that the cgf of y is 




(4.86) 


where 

h ^ 

K>iQl) = /Ci^ + IC2 ^ • 

Hence 

m) - + 

(4.87) 

■L ,bh^ , c 

== ah -] ^ H — z 

n 2 


^ 24 ^ 


so that the first four cumulants for the distribution of y are a, 6/n, c/n^j d/n^. 
It follows that 


(4.88) E{ {y - aY) = d/n^ + 

Now if P{y) is the distribution function of y^ and g{y) is a given non-negative 
function of y, 

(4.89) E[g{y)} = ( g{y) dP{y) > f g{y) dP{y) > 6 ( dP{y) 

for arbitrary positive e. Since the last integral in (4.89) is the probability 
that g(ij) > c, we have 

Pr{^(2/) > *} < Elgm/^ 

Letting g(j/) = (^ — aYj and 5 = we have 

(4.90) Pr{| 1 / — a 1 > 5} < E{y — aY/B^ = (d 3n¥)/8hi^ 

This is a modification of the Bienajun^-Tchebycheff inequality, which is 
generally sharper than the latter, assuming that the fourth moment exists, 
because of the higher power of n in the denominator. The Bienaym4- 
Tchebycheff inequality states, in this case, that 


(4.91) 

From (4.90), 


Pr{l 2 / - a I > a} < 


n8^ 


Spr|iv-a|^i|£2ft + ^) 


where ki == Sb^/d\ fe = d/d\ Both series and converge, so 

that 
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(4.92) , lim 2 Pr{| 2 /„ — a I > = 0 

iv-^.oo 

where 

Vn - ^ {Xl + X2 Xr) 

Th 


This is a stronger form of the law of large numbers, since it states that the 
total probability for n — iV, or iV + 1, or AT + 2, and so on indefinitely, tends 
to zero as N increases. 

Let Ank denote the event “\yn — a \ > 1/k,^’ k = 1,2,3 •• • and let Bk de- 
note the event “| 2/n — a | ^ l/k for infinitely many values of Note 
that the truth of Bk does not imply that this inequality holds for all values of 
n, or even for all values greater than some fixed N. 

Now, with the notation of § 1.6, the event Ani + A n 2 means either or both 
of the events Ani and 4n2* Hence the event Bk is certainly included within 
the events represented by 


Ank A" An^lyk + + • • • 

for every value of n. 

It follows from one of the fundamental theorems of probability that 
Pr(J5A:) < Pr(An* + An+1, + * * *) 

00 

By (4.92), lim ~ value of k. It follows that 

Br{Bk) = 0 

Now let B denote the event lim yn a. If this is true, at least one of the 

TO— ^ 00 

events Bk must be true for some finite k, and we may write 


B — Pi + Bi + 53+ • • • 

Therefore 

Pr(P) < Pr(Pi) + PrCPa) + • • • 
and since Pr(B*) = 0 for every k, it follows that 

(4.93) Pr(P) = 0 

In other words, we have the result, due to Kohnogoroff, e 3 q)ressed in the fol- 
lowing theorem: 

Theorem 4.18. 


(4.94) Pr{ lim yn = a} =1 

TO— J-OO 

This is the strong law of large numbers. It implies that in almost every un- 
ending sequence of variables 2 / 1 , 2/2 • • • , where yn = (1/n) (a;i + X 2 + h Xn) 

and the a:^s are independent random variables with the same distribution, the 
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sequence tends to a limit a. It does not imply the strict mathematical con- 
vergence of yn to a, although such convergence would imply both the strong 
law and the weak law. 

The probability mentioned in (4.94) is that of a limit, and hence is difficult 
to interpret in any elementary sense. The axiomatic approach is perhaps the 
only safe one. 

It may be noted that the assumption of a finite fourth moment for the dis- 
tribution of the variables Xt is not necessary. If the variables are independent 
the existence of a first moment alone is sufiicient, but a very different proof 
is needed in this case. 

4.17 The Central Limit Theorem. Let Xi, Z 2 , • • • Xr • * • be independent 
random variables, all with the same distribution function^ F(x). We can 
suppose the origin and units chosen so that E{Xt) = 0 and E(Xt^) = 1. 

T^et the characteristic function of any X^ be 


C(t) = J°°e^‘^dF(x) 

If all moments of Xi up to the nth exist, that is, if 



dFix) 


exists, then C(t) possesses continuous derivatives of all orders up to and in- 
cluding the nth. Since we are supposing that E(Xz^) = 1, so that n is at 
least 2, we may write, by (4.37), 


cit) = 1 - I + o(.F) 


where by o(p) we mean terms of order less than that is, a function such 
that o{t^) /i® — » 0 as ^ » 0. Let 

(4.95) Zn = (Xi + X 2 + • • • + Xn) 


The characteristic function of Zn is 


C } = Cn(t) say, where 


(4.96) 


c„(f) = {1 




n 


Now for fixed positive t, lim n/P • o(jP/n) = 0. Therefore | 2n o(f/n) | < €, 

»~>oo 

for sufficiently large where e is any positive quantity, as i^mall as we like. 
But 



— 2n oif/ri) 
2n 


T 


1 That is, Pr[Xr < a:} = F(x). 
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and hence 

Proceeding to the limit as n — > oo , 

e-«“+‘)/2 < lim Cn(t) < 

n—*cc 

or 

(4.97) lim Cn{t) = 

n—^ oo 

Since is the characteristic function of the standardized normal distribu- 
tion, and because of the uniqueness theorem for characteristic functions, we 
can state that the limit of the distribution function for Zn is ^(x), the distribu- 
tion function of the normal law, that is, 

(4.98) lim PrfZn < x} ^ 4>(x) 

n — +00 

This is a form of the Central Limit Theorem. 

We can easily remove the restrictions that E{X^) = 0 and E{Xi^) = 1. 
If E{X^ = At and E{(X^ — /i)^} = we simply change the variable to 
Xi = {X^ — ij)/crj and then, from (4.95) and (4.98), we have 

lim Pr ( — < X ) == 4>(a:) 

^ aVn ^ 

that is, if ^Xt = nX, we have 

Theorem 4.19. 

(4.99) lim Pr I Vn - -- 'j < a;| = 4>(a;) 

The conditions as stated above for the Central Limit Theorem, although 
sufficient,, are not necessary. The theorem has been proved many times, 
under conditions of greater and greater generality. 

For instance, it is not necessary that all the X* should have the same dis- 
tribution. If the variance of Xt is and if 

(4.100) 

then, if E(Xt) = 0, 

(4.101) Pr{ ^AL± . : . : i. ± 
as n 00 , provided that 

(4.102) and therefore — as n— >oo 

Sn 

and that all the Xi are of the same order of magnitude. This is, roughly 
speaking, the Lindeberg condition. 
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Again, if the third moment exists for the distribution of the Xj, and if 

(4.103) r\x\^dFM 

J-cx> 

then it can be proved that if 

(4.104) lim - = 0 

n 00 Sfi 

the Central Limit Theorem holds. This is the Lzapounoff condition? 

The theorem has been stated in its most general form by W, Feller,^ who 
has given conditions for the existence of a sequence of constants ai, a2, • * • and 
another sequence 62; • * * , such that 

(4.105) lim Pr{ ^l+ - ‘ ' ' ^ ~ < xl = ■i-(a:) 

It is not even necessary that the variables Xi, * • • Xn should be independent. 
A form of the Central Limit Theorem will still hold even if the X^ are dependent, 
provided that any two are independent if their subscripts diJffer by more than 
a fixed number m. It is very remarkable what a wide variety of distributions 
tend in the limit to the normal form. This fact, together with the mathe- 
matical tractability of the normal law, accounts for the central position of the 
normal distribution in mathematical statistics. 

Problems 

1. If X and Y are independent variables with frequency functions given respectively 
by 

f{x) = X >0 

giy) = y>0 

prove by means of the law of convolution (equation 4.22) that the frequency function of 
X 4-7 is 

h(w) — 

2. If X and Y are variables with a joint frequency function /(a;, y), and if ?7 = 7/X, 
apply the method of the example in § 4.3 to find the frequency function of U. 

Hint 

Pr{C/<u) -Pr{7<wX} ifX>0 
or=Pr{7>wX} if X < 0 

Draw a diagram showing the areas in the XF-plane corresponding to U <u, and hence 
write down the distribution function of 27. Differentiate to get the frequency function. 
Show that 

X oo 

i ^ 1 dx 

'00 

3. In Problem 2 suppose that X and Y are independently and uniformly distributed on 
the interval (0, 1) so that /(a;, 2 ^) — 1 everywhere inside a unit square and 0 outside. Prove 
that 

h{u) = 0 , u <0 

h(u) = 0 < w < 1 

Mu) = ^, « > 1 
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4. In Problem 2 suppose that X and Y are independently and normally distributed about 
0 with unit variance. Prove that 

so that U has a Cauchy distribution. 

5. Prove that the characteristic function of the Laplace (double exponential) distrihution 

fix) = -00 < a; < 00 

IS 

«■> - ttt> 

6. Prove that if Xi^ ;C 2 , • • • W are independent variates, and if Ci, C 2 , • • * Cjv are arbi- 
trary constants, not ail zero, the mgf oi L — C\Xi ^ Cix% + * • • Ci^xn is given by 

v 

M^h) = 

i =1 

where M}ih) is the mgf of Xj. 

Hint. Use § 4 4. 

7. The faciortal moment of order r of the discrete distribution /(re) is defined by 

X 

where 

= xix — l)(x — 2) • • (x — r -f 1 ) 

Show that the factorial momenUgeneraUng function is 

Hm = 2(1 + hYfix) 


8 . Find the moment-generating function for the triangular distribution defined by 

o 

fix) =x, 0 <x<l 

fix) = 2 — X, 1 < X <2 

Ans. (e* - 1)*A*- 

9. Calculate the first four cumulants of the binomial distribution. 

10. If fix) = (2/6) (1 — x/b), 0 <x <h, show that I fix) dx ^ I and prove that the 
mgf is 


r = 0 


2y }f 

ir -f- l)(r -h 2) r! 


Hence find the expectation and variance of x. Ans. 6/3, 6^/18. 

11. Find the effect on the mgf of the bmomial distribution, iq + of changing the 
variable from x to 

rc — sn 

12. If xi and x% are independent variates, each having a rectangular distribution with 
range 1, show that Xi + X 2 has a triangular distribution. 

13. Suppose X has a continuous distribution function Fix). What are the distribution 
functions of (a) Y = e^, (6) F = sin X, (c) F - F(X). 

E^nt. Pr{e^ < x} = Pr{X < log®}. 
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Am. (a) F(]og x), 0 < x < oo; 0 for a; < 0. 

CO 

(6) 2 lF(2nw -f sin“^a:) — Fi(2n — Ott — sin"’a:!l. 

n= —00 

(c) X, 0 < X < 1; 0, X < 0; 1, X > L 

14. If F(x) -0 when x<0, and F(x) — l — e~^ when x>0, show that M(h) =2/(2~-h), 
h < 2f and find the mean, standard deviation, skewness, and excess of kurtosis of this 
distribution. Ans. 2, 6. 

16. Let X be rectangularly distributed on the interval 0 to 1, so that Pr {xi < X < X 2 ) = 
X 2 “ Xi, where 0 < Xi < X 2 < 1. 

Let the decimal expansion of X be 0 aia^az * • • and suppose that every terminating deci- 
mal greater than zero is replaced by the equivalent non-terminating one 0.5 is replaced 
by 0.4999 • • •). The distribution of ai is a discrete distribution in which each integer value 
0 to 9 occurs with probability 1/10, since ai = k when A;/10 < X < (A; + 1)/10. 

Prove that (a) the distribution of an is the same as the conditional distribution of an 
when oi, 02 , • • a,i~i are fixed. 

(6) If X' = 0 aiUaas • , the distribution of X' is rectangular on 0, 1. 

(c) If X'' = (^.azam • •, the distribution of X" is the same as that of X', the two are 
independent, and the joint distribution is a uniform distribution over the unit square. 

16 . If Xi and X 2 are independently and rectangularly distributed on the mterval 0 to 1, 
find the distribution function and density function for Y ~ X 2 /X 1 . 

Hint. Draw a diagram, shading the area wTthin the unit square for v hich X 2 /X 1 < x. 
The distribution function is equal to the shaded area. There are separate cases for x < 1 
and X > 1. 

17 . If Xi and X 2 are independently and rectangulaily distributed on the mterval 0 to 1, 
find the distribution function and density function for Y = max (Xi, X 2 ), that is, Y is equal 
to the greatest of Xi and X 2 . 

18 . If Xi and X 2 are normally and independently distributed with mean 0 and variance 1, 
and if 

Pi - mi -i- ZiiXi + Z 12 X 2 
P 2 “ m 2 *4" I 21 X 1 "j“ Z 22 X 2 

show that Fi and Y 2 are normally distributed wdth means mi and m^, variances = 
lii^ “h Z 12 ®, JU 02 ~ ^ 21 ® *4” Z 22 * and covariance fxn = I 11 I 21 Hh Z 12 Z 22 

The joint probability distribution of Fi and F 2 is called the bivariate normal distribution. 
It is characterized by five parameters, mi, m 2 , mio, mii mu- 

19 . Prove that the density function for the bivariate normal distribution of Problem 18 is 


1 r_ 1 < (yi - miY (ys - ma)* 2^{vi - 

2(1 — >j=)j ;U20 jU02 /til ) J 

where p* = pu*/m 2 qmo 2 . 

Hint. The joint density function for xi and X 2 is 
Put 

yi = mi -j- ZiiXi + Zi2X2 

y.y =! 77^2 *4“ ^21371 "4“ ^223^2 

and assume that is not zero. 

1 ^21 *22 

If this determinant is zero, the distribution is said to be singular. In this case we can 
write 


yi = mi 4* Xs) Vi ~ 


where xa — ZnXi -j- Z 12 X 2 . 
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20. Prove that, if y = ^c^x^ is a lineair function of correlated variates which all possess 

1 • 

finite variances o-i-, then the variance of y is given by 

\ ar (^) = "b ''^^CtCjPt}(Tt<T j 

i i^j 

where is the coefficient of correlation between Xt and Xj, (Cf. Theorem 4.11, which corre- 
sponds to the case ptj =0.) 

Hint As in (4.32), 

Var (j/) = E{'^c.(_x. - 

i 

Use Theorem 4.8. 

21. Calculate the factorial mgf and the rth factorial moment for the Poisson distribution, 
(See Problem 7.) 

22. A discrete distribution is defined hy f(x) = p(l — p)*, where 0 < p < 1 and 
X —Oj 1, 2, 3 * • * . Calculate the fa6torial mgf and the rth factorial moment for this 
distribution. 

23. A sample of AT is taken from a population having the distribution defined by 

fix) = p(l — p)*, 0 < p < 1, a; = 0, 1, 2, 3, • • • . The frequencies of 0, 1, 2, 3, * • in the 

00 

sample are /o, fi, ftj /s, • • • , where = N, 

i^o 

Calculate the probability of this observed sample, and find for what value of p this prob- 
ability is a maximum. The value so defined is known as a maximum likelihood estimate of p. 
It is, of course, a function of the sample frequencies. Ans. p = (1 -j- where x is the 
mean value of x for the sample. 
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CHAPTER V 


THE GAMMA, BETA, AND CHI-SQUARE DISTRIBUTIONS; THE PEARSON 
AND GRAM-CHARLEER SYSTEMS OF CURVES; CURVE FITTING 


5.1 The Gamma Distribution. A continuous variable x, distributed with 
probability density 

(5.1) fix) = 0<a;<oo, m > 0 


may be called, following Weatherbum/ a Gamma variate with parameter m, 
or, for short, a 7 (m) variate. 

The curve of fix) is asymptotic to the x axis, and touches it at a; ~ 0 if 
m > 2. This curve belongs to Type III of the Pearson system of curves 
which will be described later. 

The rth moment about the origin is given by 


(5.2) 


Hence 


and 

(5.3) 


1 / 

Vr = \ I dx 

r(m) Jo 

= + 1) • • • (m + r - 1) 

Pi = m 

V 2 — m(m + 1), etc. 

^2 ^ m 

fxz = 2m ^ 

/X 4 = + 6m 


Skewness and excess of kurtosis are therefore given by 

(5.4) 7i = 2m-^^^, 72 = Gm”^ 

The mgf is 

(5.5) M{h) = r dx 


Hence the cgf is 
(5.6) 

for ( A I < 1. 


r(m) jo \l-h) l-h’ " ^ ^ 
(1 - ft)-" 


KQi) =-TOlog(l - ft) 


m 


ffc -u -I- -u 1 

r + ^ + s’ + ■ ■ ■_ 


04 
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The rth cumulant is therefore 

(5.7) K, = m(r — 1)! = mT(r) 

Theorem 5.1. If xi, Xi, • • • xn are independent Gamma variates with pa- 

N 

rameters mi, m^, ■ • • m^, then X = J^x, ^s a Gamma variate with parameter 

t=i 

N 

M = ^m,-. , 

Proof: The cgf of X is 

KQi) = log (1 - h) 

i i 

= — ikf log (1 — h) 

Theorem 5.2. If x is a normal variate with mean vx and standard deviation 
(Tj and if V = (x — viy/2a^, then v is a Gamma variate with parameter 
Proof: The frequency function for z is 



The probability of a value of v between v and + dv is 

f{v) dv = 2/(x) dx 


since as x goes from — oo to +oo , i; goes from oo to 0 and back again from 
0 to 00 . 

Since 

dv = ^^-^dx - (2r)i/V-i dx 
a 

we have 


fiv) = 


^ <T 

^-i/2g--Yr(|) 


which is the frequency function of a 7 (|) variate. 

5.2 The Beta Distributions. A continuous variable x, distributed with a 
probability density 

(5.8) fix) = x^“^(l — x)^”VB(Z, m), 0 < x < 1 

will be called a Beta variate with parameters I and m or, for short, a P(l, m) 
variate. 

Since by (3.22), 

(6.9) B(Z, m) — J* x^~Kl + x)""^"^ dx 

\ 

we can also speak of a continuous variable with probability density 

(6.10) /(x) = + x)"^^/B(?, m), 0 < X < oo 
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as a Beta variate. To distinguish the two cases we will call it a Beta-prime 
variate or m) variate. The curve of /(x) given by (5.8) belongs to Pearson^s 
Type I, and that given by (5.10) to Type VI. 

If Z > 2, the curve of (5.8) is tangential to the axis at the origin, and if 
> 2 it is tangential at a; ~ 1. The rth moment about zero is given by 


(5*11) Pr 


Hence 

(5.12) 


f 

B(Z, m) Jo 
B(Z + r, m) 
B(Z, m) 


l(l + 1) 


(Z+r 1) 


Vl 


(I + m)(Z + m + 1) • • • (Z + m + r — 1) 
Z Z(Z + 1) 


I m/ 


V2 


fX2 - V2 




(I + m){l 4“ w + 1) 
Im 


(Z + m)HZ + w + 1)^ 


etc. 


The curve of (5.10) is asymptotic to the x axis, and touches it at x — 0 if 
Z > 2. The rth moment about zero is given by 


(5.13) 


if r < m. 
(5.14) 


— B(Z + r, m r) _ Z(Z + 1) * » * (Z + r — 1) 

~ B(Z, m) (m — l)(m — 2) • • • (m — r) 

Therefore 


Vl 


Z 


m — V (m — l)(m — 2) 

1(1 + m 1) 


Z(Z+1) 


(m - iy(m - 2)’ 


etc. 


The method of proof used in the following two theorems is one that is 
frequently useful in obtaining new distributions. 


Theorem 5.3. If x and y are independent Gamma variates with parameters I 
and m respectively j then x/(x + y) is a Beta variate with parameters Z, m. 
Proof: The joint probability density for x and y is 

(5.15) Six, y) = 

where 0<r<QO,0<2/<QO- 
We introduce new variables u and t?, given by 

(5.16) x + y, r = x/(x + 2/), 

ox X uv, y ^ u(\ — t;), and find the joint probability density for u and v. 
Then by integrating out either variable we have the probability density for 
the other one. 

The Jacobian of the transformation is 
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(5.17) 


j = I *’ “ 

\Ujv) I 1 — —u 


= —2^ 


Hence dxdy = u du dvj and the probability density is given by 

g(Uj v) du dv = f(xj y) dx dy — fix, y)u du dv 
or 

(5.18) 


g(u, v) = 


r(i)r(jw) 

1 


e-“(w)'-iu”‘-i(l — v)“-i 


r(Z)r(m) 

The range of u is froln 0 to oo and of v from 0 to 1. The distribution of u is 
therefore given by 

^ Z+w — 1 


(5.19) 


fiu) 




v) dv = 


r(i! " 4 “ 

showing that w is a y(l + m) variate, as already proved in Theorem 5.1. 

The distribution of v is given by 

, ^ - t ;)— 1 

* 

showing that i; is a m) variate. 

Theorem 5.4. If x and y are indepe^ident Gamma variates with parameters I 
and m respectively, then xjy is a Beta-prime variate with parameters I, m. 

Proof: Let u — x + y, v = x/y. Then 


(5.20) 


Kv) 


- !>■ 


J 


Therefore 

and 


/ujA ^ 

Uw 

\u, vj 
dx dy = 






(1 + 1;)-- 
u 


u 


(1 + t;)^ 
u 


du dv 


(1 + vY 

The joint probability density of u and v is 
1 


(5.21) 


1 


u 


(1 + vy 


g— lj;j— 1(1 y^-l-n 


V{l)T{m) 

The range of u is from 0 to oo and of v from 0 to oo . Hence the probability 
density for v is 


(5.22) 


h{v) 


- r»(“- 


v) du 




B(l, m) 

which is the frequency function of a fi'Q, m) variate. 
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5.3 The Chi-square Distribution. Theorem 5.5. If x„ i — \,2, • • ■ n, are 

independent, normally distrdmted variables, with means /i, and variances 
then the quantity defined by 


(5.23) 


1 v2 = T ~ 

2 ^ iti 2<r,=> 


is a Gamma variate with parameter n/2. 

Proof: By Theorem 5.2, where Ui is a variate. Hence, 

“by Theorem 5.1, Jx® is a y(n/2) variate. 

The probability density is therefore given by 


(5.24) 


f(ix^) d(ix^) 



e-ix=(ix2)("/«-id(|x*) 


The rth moment of the distribution of x* is 


(5.25) 



= n(n + 2)(n + 4) • ■ ■ (n + 2r — 2) 


since the effect of multiplying the variable by 2 is to multiply the rth moment 
about zero by 2’’. / 

The rth cumulant is, by (5.7), 


(5.26) 

Consequently 

(5.27) 
and 

(5.28) 


K, = 2T(r)| = 2^Hr - 1)1 n 

f /i 2 = 2n 
jU3 = 8n 

fii = 12?^^ + 48n, etc., 

' #ci — n 
k 2 = 2n 
Kz == 8n 

, ka = 48n, etc. 


The distribution has a positive skewness ( 8 /n )^^2 ^ positive kurtosis 12/7i, 

both tending- to zero as n increases. The number n, which is the number of 
squares of independent normal standardized variates added to produce 
is called the number of “degrees of freedom^^ (§ 7.3). 

The chi-square distribution is one of the most important in mathematical 
statistics. Some of the reasons for this will appear later. Meanwhile we 
note that the cH-square distribution shares the reproductive property of the 
Gamma variates. 


Theorem 5.6. If x and y are independently distributed as omd 
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degrees of freedom respectively j then x + y is distributed as voith ni + 
degrees of freedom, 

TMs is au immediate corollary of Theorems 5.1 and 5.5. The converse 
theorem is often useful, namely, 

Theorem 5.7. If the sum of two independent positive variates is a x^(^i + rii) 
variates and one of them is a variatej then the other is a xKnf) variate. 

Proof: The mgf of a y{ni/2) variate is (1 — Hence ]£ MQi) m the 

mgf of the second variate, 

(1 - == (1 _ hy^^mih) 

whence 

MQi) = (1 - 

Theorem 5.8. For large values of n, Vix^ is approximately normally dis- 
tributed about with unit variance. 

Proof: The mgf of x^ is (1 — Hence the mgf of the standardized 

variable (x^ — n)/V^ is 



For a fixed value ef h we can take n so large that this •-> which is the mgf 

of the standardized normal law. 

Hence (x^ — ^)/ ( 2 n )^^2 tends to a normal distribution with mean zero and 
unit variance. 

Now the distribution function of V2^ — V2n is equal to the probability 
that 

< a: = Pr{x* < §[* + (2n)i/=]='} 

= Pr{(x' - n)/(2w)i/* <x + |a:V(2w)i«} 

As ra — » 00 , this tends to the limit of the probability that (x° — n)/(2n)^/* < x, 
which is #(a;) defined in (2.38), as just proved. Hence (2x‘*)^'^ — (2nyi^ is 
approximately a standard normal variate, for large n. 

The above investigation does not indicate how good the approximation is 
fo r moder ate values' of n. Fisher has shown that it is improved by putting 
V2ri — 1 instead of V^. This approxuuation is often used to calculate x^j 
for values of n larger than about 30. 

A still better approximation is that of Wilson and Hilferty ® who showed 
that is very nearly normally distributed about 1 — 2/9n with a 

variance of 2/9». 

Thus, to calculate the 95% point for x®, that is, the value of u such that 
Pr{ic* < «} = 0.95, we write 
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(5.29) 


X* = 


n 


[- 


9n 


+ t 



3 


where i = 1.645, corresponding to ^(t) == 0.95. For n = 30, this gives 

= 43.77, which is correct. The Fisher approximation is 43.49. 

5.4 Distribution of Sums of Squares. The following theorems are often 
useful in determining the frequency function for a variate. 

Theorem S.9 {Fisher^ s Theorem). Lei A be a sum of squares of n independent 
normal standardized variates and suppose A == 5 + where B is a 
quadratic form in the :r^, distributed as '^th h degrees of freedom. Then C is 
distributed as '^ih n — h degrees of freedom, and is independent of B. 

Proof: A is distributed as with n df . 5 is a sum of squares of h orthogonal 
linear functions, yi, 2 / 2 , * • • Vhj of the Xi. By § 4.13 we can find n — h further 
functions yh+i, * * * Vnj which are orthogonal among themselves and also to 
Vh 2 / 2 , • * • Vh, and are such that 

- is.- 

1 1 
n 

Hence A — B so that C is the sum oin — h squares of independent 

normal variates. It follows that C is distributed as x^ with n — A df, inde- 
pendently of B. The theorem can be extended as follows 

n 

Theorem 5.10. If A = JBi + B 2 + • • • + JSa. + C, where A == 

1 

jB,(i = 1 , 2, • • 'h), is a sum of squares of n^ variates y, which are independent linear 

1c 

functions of the Xt, and < n, then C is distributed as x^ wth n — de- 
grees of freedom independently of the B^. * 

A converse of this theorem, given by Cochran,^ states that, if B and C are 
distributed as x^ with ni and n 2 df and if A = J5 + (7 is distributed as with 
ni + nt df, then B and C are independent. This is not, in general, true, but 
if B and C are not independent they must be related in a special way. If we 
think of the joint frequency function /(a:i, X 2 ) of the variables Zi and x^, where 
0 < xi < 00 and 0 < 0:2 < 00 , as a distribution of mass over the first quadrant 
of the X 1 X 2 plane, then it is clearly possible to remove some of this mass in 
some places and redistribute it in other places in such a way as to keep both 
marginal distributions unaltered and also so as to keep the total mass unaltered 
along any line xi + — constant. (This last condition requires that if mass 

is removed at any point an equal quantity must be deposited at the mirror 
image of that point in the diagonal line xi = 0 : 2 , and if mass is deposited any- 
wihere an equal quantity must be removed at the mirror image.) 

If originally xi and were independently distributed, so that f(xi, X 2 ) = 
fiix) f 2 {x 2 ), this would no longer be true after the redistribution, but the 
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distributions of Xi, X 2 and Xi + would all be unaffected. The theorem is 
therefore not always true. It is true if B is a sum of squares of independent 
linear functions. 

Problems 

1. Sketch the curves of the Gamma distributions with values 1, 2, 3, of the parameter. 
Since T{m) = (w — 1)^ tables of the Poisson function may conveniently be used here. 

2. Prove that the mode of a y(rn) variate, m>l, isatx — w — l. Hence show that 
Pearson^s definition of skewness, 

g]^ ~ niean — mode 

^ standard deviation 

gives, for this variate, a value one half that of the moment definition, 

Skji/ = 7i 

3. Prove that the mode of a /3(Z, m) variate, I > 1, m > 1, is at x = (? — l)/(^ + m — 2), 
and the mode of a m) variate, I > 1, is at x = (I — l)/(m 4- 1). 

4. Sketch the curves of a /3(f, 2) distribution and of a 1) distribution. 

6. Prove that, if x is a i3'(Z, m) variate, then 1/x is a 1) variate. 

6. Prove that if x and y are independent normal standard variates (that is, with mean 
zero and variance 1), then z = x/y has a Cauchy distribution. (See Problem 4, Chapter 
IV.) 

Hint By Theorem 5.2, x^/2 and y^/2 are rCi) variates Therefore, by Theorem 5.4, 
is a i) variate. Hence obtain the distribution of z and show that its frequency function 
is 4“ Note that z goes from —00 to 00 but only from 0 to 00 . 

7. Prove that if x is a /3(Z, m) variate, then (1 — x)/x and x/(l — x) are ^'(m, 1) and 

variates respectively. (See Problem 5.) 

8. Prove that if n = 1, the distribution of x (uot x^) is normal. 

Hint In eq. (5.24) note that d(x^/2) = x dx, and that as x goes from ~oo to 00 , x^ goes 
from X to 0 and back again from 0 to x . 

5.5 Pearson System. There are two systems of generalized frequency 
curves in common use: the Pearson system and the Gram-Charlier system. 

During the years 1895-1916 Karl Pearson pubhshed papers in Avhich he 
showed that a set of frequency curves could be obtained by assigning values 
to the parameters in a certain first order differential equation. The Pearson 
school claimed that all the different types of frequency distributions that arise 
in practical statistics can be represented by the solutions of this equation. 

With regard to the genesis of the Pearson system, one point of view is to 
regard it as empirical. The differential equation of the normal curve may be 
written 

(5.30) dy/dx — y(m — x)/a 
where a > 0. Pearson generalized this by writing 

(5.31) dy/dx = y(m — x)/(a + bx + cx^) 

Among the solutions of (5.31) there are several t3q)es of curves, the shapes 
depending on the parameters a, h, c, and m. Examples of symmetrical, 
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skewed, TJ-shaped and J-shaped curves with finite and i nfini te range in either 
or both directions, are shown in Figure 13. 



Fig. 13. Typical Curves of the Pearson System 


The types of curve are distinguished by the values of b2/4ac, the three main 
types being 

L b^fiac <0, cx< X < C 2 

where ci and C 2 are the roots of a + bx + cx^ ^ 0, 

IV. 0 < b^/4ac < 1, —00 < X < 00 

VI. b^(4:(ic >1, Cl < a; < 00 

where ci is the larger root of a + bx + cx^ = 0. 

There are two special (symmetrical) types, namely, 

II. b^l4ac = 0, c < 0, —Cl < a; < Ci 

where Cl = (— a/c)^^^ 

VII. ¥l4ac = 0, c>0, — oo<aj<oo 

There are also two transition types, namely, 

III. b^lAac = 00 , c = 0, ci<x<oo 
where ci = —a/6. This is intermediate between Types I and VI. 

V. ¥/4ac =1, Cl < x < 00 

where ci = ’-bj2c. This is intermediate between Types IV and VI. 

The normal curve is included in the system as Type 0. For this type, 
b and c are both zero, so that 6V4ac is indeterminate. There are also some 
less important particular cases which are sometimes included as additional 
types. 
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If a Pearson curve possesses an ordinary mode, it will be at ^ = m. The 
most useful curves are those for which y vanishes at two values of x, say ci 
and C 2 , where ci may be — oo and C 2 may be +oo . If also vanishes at 
both ends of the range, the rth and (r + l)th moments exist. 

The parameters in (5.31) can be expressed in terms of the moments of the 
system. Multiplying by x'' and integrating over the range ci to C 2 , we have 

(5.32) bx'^^ + cx^^) ~ jT 

Integrating the left-hand side by parts, we obtain 

[ T* 

y(ax^ + bx^'^^ + — I ylarx^-^ -f b(r + l)x^ -f c(r + 2)x»^^] dx 

dci Jci 

and by hypothesis the first term vanishes at ci and C 2 . Also 

(5.33) Vr = \ yx^ dx 

Jci 

so that 


(5.34) arvr-^i + b(r + 1)^^ + c(r + 2)vrn =-niVr + Vr+i 

Putting r = 0 in (5.34) we get 


or 


b + 2cj'i = — m + Pi 


(5.35) VI = (m + 6)/(l - 2c) 

Putting r = 1 we get 


or 

(5.36) 


a + 25vi -h 3 cv 2 = — tnvi + V 2 


V2 


d -{- (i7i -}- 25)vi 

1 -- 3c 


If we suppose the variable changed to the standardized form t = (x — vO/cr, 
we shall have in terms of the new variable vi = 0, V 2 == M 2 — 1, ctr = Mr = 
Hence 6 = — m, a = 1 — 3c, and equation (5.34) becomes 

(5.37) (1 — 3c)rar-.i — mror + {c(r + 2) — Ijor+i = 0 


Giving r the values 2 and 3, we get 


whence 

(5.38) 


2m + (1 — 4c)a8 = 0 
3(1 — 3c) — 3ma$ — (1 — 5c)a4 = 0 


71 — an = 


2m 

4c -- 1^ 


72 = ^4 — 3 = 


6(m^ — 4c^ + c) 
(4c - l)(5c - 1) 


These relations give the skewness and excess of kurtosis of any of the Pear- 
son curves for which the fourth moment exists. The relations (5.38) may be 
expressed in the more convenient forms ® 
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(5.39) b wi 2(1 + 25)’ “ 2(1 + 25)’ “ ^ ^ 

^xheTe 

d = (2^2 ■“ 371 ^)/ (72 + 6) 

so that when the skewness and kurtosis are known the distribution is de- 
termined completely. 

In Pearson^s Tables for Statisticians and Biometricians, Part I, pp._66 and 
67, there are diagrams showing the range of values of ^i( = V'7i) and 

= 72 + 3) corresponding to the various types. In some regions only 
U-shaped or J-shaped curves exist. This diagram is reproduced in extended 
form at the beginning of Part II of the Tables. 

The Pearson differential equation (5.31) has some theoretical support. If 
we think of an um containing n black and white balls, of which np are white 
and nq black, and if we imagine a sample of s balls drawn without replacements, 
then according to (2.66) the probability that x of the balls are white is 

-*)/(:) 

By representing the successive values of hn(Xj s) for a: == 0, 1, 2, • • • 5 as 
ordinates of a frequency polygon, it is possible to show ^ that the slope at the 
mid-point of any side, divided by the ordinate at that point, is equal to a 
fraction whose nqmerator is a linear function of x and whose denominator 
is a quadratic function. On equating this fraction to il/y)(dy/dx) we obtain 
(5.31), We have already seen (section 2.9) that even the binomial distribu- 
tion, when p is not equal to g, approximates to the Pearson Type III form 
rather than the normal form, although, of course, the Type III curve tends 
to the normal curve as s 00 . 

5.6 Some Pearson Types. 

Type /. The differential equation is 


(5.40) 

where 


1 % X — m mi m 2 

y dx c(x — Ci)(c2 — x) X — Cl cz — x 


m — Cl 


c(C2 - Cl)' 


m2 = — 


cz — m 
c(C2 Cl) 


both mi and m2 being positive since c is negative. 
Integrating, we obtain 

(5.41) 2/ = i4(x — Ci)^^(c2 — x)^^ 


If we change the variable to w = (x — ci)/(c2 — ci), so that 1 — = 

(c2 — x)/(c 2 Cl), we have 
(5.42) y = Bu^^(l — 


w+ich shows that 'W is a Beta variate with parameters mi + 1, m2 + 1, and 
that } /a B(mi + 1, m2 + 1). 
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The ciirv-e is tangential to the axis at = ci if mi > 1, perpendicular to the 
axis there if 0 < mi < 1 and asymptotic to the line x ^ Ciii mi < 0. Simi- 
larly the behavior at a; = C2 depends on the value of m2. 

Type VL The differential equation is 

.ON 1 ^ m - X ^ mi m 2 

^ y dx c{x — Cij(a* — c>) x — ci x — C 2 

where C2 < Ci < x, mi = (m — ci)/c(ci — C2), m2 == (m — C 2 )/c{ci — co). Both 
Ml and m2 are positive, since c > 0. Hence 

(5.44) y — A{x — — 02)“”^* 

Putting u — (x — ci)/(ci — C2), 1 + “M = (x — C2)/(ci — C2), we get 


(5.45) y = Bu^^{\ + 

so that w is a Beta-prime variate with parameters mi + 1, and m 2 -- mi — 1. 
Type III, The differential equation is 


(5.46) 


1 ^ _ m — r m 2 

y dx b{x — Cl) X — Cl 


where ci < x < 00 , mi = 1 //?), m2 = (m ~ cl) /h. Therefore 


(5.47) y = Ce-^^^{x - ci)’”^ 

Putting u = mi(x — ci), we get 

(5.48) y = 

whence u is seen to be a Gamma variate with parameter m2 + 1 and 
r(m2 + 1) = l/B. 

If the variable is standardized, b = a = 1, and Ci = —a/b = 1/m. 
Also by (5.38) the skewness is given by yi = — 2m, so that 

mi = 2/yi, m 2 + 1 = 4/7r 


Hence the distribution can be expressed in a form in which the skewness 
is the only parameter, namely, 

(5.49) y = K(f + -A <t <00 


where A = 2/71 and t is the standardized variable. 

For a Type III curve the d of (5.39) is zero, so that 71 and 72 are connected 
by the relation 

(5.50) 272 = 371" 


The constant K in (5.49) is determined from the condition 


(5.51) 
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Putting AQ, + A) = u, we get 


whence 


Ke^'A-^' du 


( 5 . 52 ) 


K = A^‘e-^'/T(A^) 


1 


The distribution of 2u is therefore identical with that of with 2A^ degrees 
of freedom. In general^ of course, 2A^ is not an integer. 

The designation “ Type III is usually 
restricted to the case for which A^ 9^ I . 
When A^ > 1, that is, when | 71 1 <2, 
the curve is bell-shaped as shown in 
Figure 14 . 

In the Pearson system, the distance 
from the mode to the mean is — m == 
7i/2(1 + 25 ), and is a measure of skew- 
Fig. 14 . Type III Cueve when \yi\ < 2 »-^ss. Under the conditions imposed for 

Type 0 , m = 0 . For Type III, how- 
ever, m = —71/2 and therefore we have 

7i 

mean — mode = ~ 



Because of this relation yi/2 is sometimes used as a measure of skewness in 
observed distributions. The curve for 71 = — A (A; = a constant) is a reflec- 
tion of that for 71 = A: through the line t = 0 . 

When A^ < 1 , that is, when | 71 1 > 2, the curve is J-shaped with an infinite 
ordinate at ^ = — A, When A ^ > 2, the curve is tangential to the t axis at 
t=-A. 

Tables of ordinates and areas of the Type III curve have been published by 
Salvosa * in the Annals of Mathematical Statistics^ 1 , 1930 , p. 191 . 

Ty'pe VIL The differential equation is of the form 


( 5 . 53 ) 


1 dy _ m — X 
ydx'' c{x^ + k^y 


— 00 < X < 00 


where = a/c. If the variable is standardized, m == 6 = 0 and = 1 /c — 3 . 
We obtain on integration, calling the standardized variable Uj 

logy = - ^jlog + A: 2 ) + A 
or 

(5.54) ~ y = Biu^ + jfc2)(^3)/2 


* These tables have been republished by Edwards Bros., Ann Arbor, Michigan. 
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This is a symmetrical curve, asymptotic to the x axis in both directions. 
Putting u = kt(jk^ + n = A:® + 2, we find that the distribution of t is 

given by 

(5.55) /(<) = 5i(l + 

which is a very important distribution in statistics, usually known as Sticdent’s 
t-distribution. Its genesis will be discussed later in connection with sampling 
theory. 

A systematic treatment of all the curves m the Pearson system has been 
given in a paper by C. C. Craig.® 

5.7 Gram-Charlier and Edgeworth Series. Distributions are often en- 
countered which are approximately normal and whose frequency functions 
may be represented by a series terms of which the first corresponds to the 
normal law while the others rapidly decrease in importance. We shall assume 
that the variable has been standardised, and denote it by f = (x — vi)/<r. 
By repeatedly differentiating the function we obtain 

at 

£ (e-^/2) = (P - i)e-^n 

and in general 

(5.56) £ (6-“!^) = (-l)»H„(0e-'‘/=‘ 

where Hnit) is a pol 3 aiomial in tj of degree n, called the nth Hermite folynomiah 
By repeated integration by parts it is ea^ to show that 

(5.57) 

Hence if stands for ( 27 r)’“^/ 2 g ~<*/2 ^ assume that a given fre- 

quency function can be expanded in a series 

(5.68) f(t) = CQ(l>(t) + Ci<l>^(t) + • • • + + • • • 

we can formally obtain the constants in the series by means of (5.57). Multi- 
plying (5.58) by Hn{t) and integrating term by term, we have 

(5.59) f^it) dt = (~l)»n! Cn 

J-oo S J^ao 

since all terms in the 'sum except that for which j = n give zero on integration. 
Substituting Ho == 1, ffi = t, H^^t^ - 1, Hs = - % H^^i^ - + 3, 

we obtain 
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(5.60) 


Therefore 

(5.61) 


Co = y* /(<) dt - I 
Cl = - y* t^it) = 0 
C.J, = j\t^ - l)/(0 df = 0 

Ci = — as/S! = — Ti/6 

C4 = (a4 — 6 + 3)/4! = 72/24 


fit) =?.(0 


This is the Gram-Charlier A series. It has been shown® that the series is 
not convergent except under rather restrictive conditions. However, the 
important thing is not whether the series converges but whether a few terms 
provide a good approximation to f(i). We know that, if a variable x is the 
sum of n independent random variables, then, under the conditions of the 
Central Limit Theorem, the distribution function of ^ (x — vi)/(t is for 
large n approximately equal to ^(i). Also, if the mdependent variables all 
possess continuous frequency functions, the frequency function of x is, under 
rather general conditions, approximately equal to The question now is 
whether the approximation will be improved by including additional terms 
beyond the first in (5.61). It appears that from this point of view the co- 
efficients in (5.61) do not steadily decrease as the order of the derivative 
increases. In fact, while cs, C 4 , cs are of orders respectively, 

ce is again of order Hence if we decide to include the term in we 
should also include the term in since it is of the same order of magnitude. 

Edgeworth introduced a more satisfactory series, which is a straightforward 
expansion in powers of n, and is an asymptotic series with a remainder term 
of the same order as the first term neglected. It may be written 

(5.62) fit) = <#.(0 - ^(«(i) + { J <^(-)(0 + ^ ^<«(<) } + • • • 

The mode of this distribution is at \yi approximately. 

Tables giving the area and ordinate of the standardized normal curve, as 
well as the derivatives of all orders from the 2nd to the 8th, may be found in 
Glover^s Tables of Applied Mathematics in Finance, Insurance, Statistics. Four- 
figure tables of the 2nd, 3rd, and 4th derivatives are given in the Chemical 
Hubber Company's Handbook of Chemistry and Physics. 

5.8 Curve Fitting. The attempt is often made to fit one of the known 
theoretical distributions to an empirical distribution obtained from a sample. 
If the fit is satisfactory, it is a reasonable hypothesis, not disproved by the 
data, that the parent population from which the sample was obtained does. 
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in fact, follow the theoretical distribution in respect of the particular quality 
measured. 

The usual procedure in fitting is the ^'method of moments” advocated by 
Karl Pearson and his school. The first few moments of the given distribu- 
tion are calculated and used to estimate the corresponding moments for the 
hjrpothetical parent population. A suitable type of theoretical curve is 
selected on the basis of these moments, and the parameters of the curve are 
determined. The curve may then be drawn and compared \vith the empirical 
distribution of the sample. The goodness of fit is judged by means of the 
chi-square criterion which will be discussed later, in § 5.14. 

In Chapter XII it will be shown that this method of moments is not (except 
for the normal, binomial, and Poisson distributions) the most efficient method 
of estimating the parameters, efficiency being judged by the smallness of the 
sampling variance of the statistics used for estimation. Moreover, the chi- 
square criterion is not strictly appropriate unless the estimation is done by a 
method having maximum efficiency, such as the method of maximum likeli- 
hood. However, for theoretical distributions which do not depart very widely 
from the normal curve, the method of moments is so convenient that it may 
be worth using, even at some sacrifice of efficiency. For a further discussion, 
see §§ 12.5 to 12.7. 

The number of parameters, and hence the number of moments to be cal- 
culated, depends on the type of curve selected. The Poisson curve has only 
one parameter, the normal and binomial curves two, the Pearson Type III 
three, and the main Pearson types four parameters. Four are also required 
for the Edgeworth series if we stop at the third approximation. It is seldom 
worth while going further than this because of the relatively great sampling 
errors in the higher moments. 

In calculating the moments for a grouped distribution Sheppard's correc- 
tions (see § 4.12) may usefully be employed where the sample is at least 400 
or 500 and the distribution tails off gradually at both ends. 

In order to estimate the moments of the parent population we must appeal 
to some results from the theory of sampling which will be discussed in Chapter 
VII. It may be proved that unbiased estimates of the cumulants ici, #C 2 , #C 3 , ^ 4 , 
from a sample of size N are prbvided by certain sample statistics ifei, kt, ^ 4 , 
defined as follows: 

(5.63) ki = m 

(5.64) h = Yzri ^ = jv^l 

= (iv - 1){N - 2) 

M2 

(5.66) ki = __ 2)(N — 3) ~ 3(iV' — l)s^] 
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where m is the sample mean 5 is the sample standard deviation, and 
W2, m3, niA are sample moments about the mean. The ^-statistics are said 
to provide unbiased estimates of the corresponding kappas because 

(6.67) E(K) = i = 1, 2, • • ‘ 

Estimates of 71 and 72 are provided by the statistics gx and defined by 

(5.68) gx - 

(5.69) go, = k^^/ko^ 

and these are unbiased when the parent population is normal. 

Having estimated the cumulants for the parent population, we have to 
decide on the type of curve to be fitted. For a normal curve it is necessary 
that the skewness and excess should be so near zero that the difference may 
reasonably be attributed to sampling errors. For a Poisson curve all the 
cumulants must be equal. For a Pearson Type III curve 272 = 871^. If 
none of these conditions is reasonably well fulfilled, we can try a more general 
Pearson type, using the diagram referred to in § 5.5 to determine which type 
is appropriate. Alternatively, we can try an Edgeworth series. 

In order to judge whether such relations are satisfied with reasonable 
probability, we need to know the standard errors of the statistics concerned. 
The standard error of any statistic is an approximation for large N to the 
square root of the true variance, any population parameters which occur in 
the expression for the variance being replaced by the sample estimates. The 
standard errors of X and 5 are and (m4 — respectively. Those 

of gi and g 2 are about (jd/Ny^^ and (24 :/ respectively (see § 6.10), if the 
population is normal, but may be quite different from these values for other 
types of population. 

5.9 An Example of Curve-fitting. Consider the data in Table 2 giving 
-weights of 1000 Glasgow school children to the nearest pound. 

Table 2. Weights or Glasgow School Children (neamst pound) 


X(lb) 


/ 

F 

Probit 

28-31 

31.5 

1 

1 

1.91 

32-35 

35.5 

14 

15 

2.83 

36-39 

39.5 

56 

71 

3.53 

40-43 

43.5 

172 

243 

4.30 

44-47 

47.5 

245 

488 

4.97 

48-51 

51.5 

263 

751 

5.68 

52-55 

55.5 

156 

907 

6.32 

56-59 

59.5 

67 

974 

6.94 

60-63 

63.5 

23 

997 

7.75 

64-67 

67.5 

‘ 3 

1000 

— 


The class-interval is here 4 lb and the true values of X at the ends of the 
intervals, are as given in column 2. By the usual methods the values of 
the Aj-statistics, with Sheppard^s corrections, are found to be 
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(5.70) 


whence 

(5.71) 


h - 47.712 lb 
h - 33.342 lb2 
kz = 22.074 Ib^ 
fc4 115.95 Ib^ 

fgi = 0.114 
lgf2 =-0.104 


The standard errors of gi and gz are about 0.077 and 0.154 respectively, so 
that gi differs from zero by about 1.5 times its standard error and gz by about 
of its standard error. The probabilities of differences as large as this, 
assuming the distribution to be normal, are about 0.14 and 0.50 respectively. 
Hence we can regard the true skewness and the true kurtosis as both zero. 
We take the curve as 


(5.72) 




where N = 1000, y, = 47.712 lb, <r — 5.774 lb. (The symbols jx and will 
frequently be used for the population mean and variance, respectively, in- 
stead of VI and M 2 .) 

In order to plot the curve with frequencies per 4-lb interval as ordinates if 
is convenient to write 

(5.73) y = - 692.8<3!>(0 

where 


(5.74) (X - = (X - 47.712)75.774 


For selected values of <#>(0 is found immediately from Table I in the 
Appendix, and the corresponding values of X are calculated from (5.74). 

5.10 Approximate Tests of Normality. A rough test of the normality of the 
distribution may be made by plotting the percentage cumulative frequency 
lOOF/X against X on special “probability paper.^' In Table 2, column 4, 
the values of F are given corresponding to Xe*. These are to be divided by 
10 to give percentages, since X is here 1000. 

On ordinary graph paper a smooth curve drawn between the plotted points 
will approximate in shape, if the distribution is nearly normal, the ogive curve 
of 4>(x), the distribution function of the normal law. On probability paper 
the scale of percentage cumulative frequency is so drawn out at both ends and 
compressed in the middle that the curve of 4>(a:), or rather 100#(a;), becomes a 
straight line. Hence if the graph of lOOF/X is nearly straight, the distribu- 
tion is approximately normal. 

Instead of using probability paper, we may achieve the same result by 
plotting “ probits corresponding to F on ordinary paper. A convenient 
table of probits may be found in Statistical Tables by Fisher and Yates (Table 
IX). In Table 2 above, the probits corresponding to the values of F/10 are 
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given, and when these are plotted against X the points lie reasonably well on 
a straight line (Fig, 15) . It may be noted that the graph gives rough estimates 
of the mean and standard deviation of the parent distribution, since probit 5 
corresponds to m and probit 6 to /x + 



The customary method with ungrouped data, when these are not too 
numerous, is to arrange the items in ascending order of X and to suppose that 
the percentage cumulative frequency corresponding to the Ath item out of a 
total of n is 100(fc — |)/n. The fcth item is, as it were, split in two, half 
going with the preceding h — 1 items to make a total cumulated frequency 
of 

5.1 1 The Multinomial Distribution. We consider a sample of N individuals 
grouped in k classes, the respective class frequencies being /i, /a, • • • /*, where 

k 

= N. Following Karl Pearson, we regard this sample as a random 

sample of a hypothetical parent population in which the probability of belong- 
ing to the ith class is p^, i = 1, 2, • • • fc. The joint distribution of the class 
frequencies is then the multinomial distribution. 

Consider an event that is characterized by a variable v which can take on 
one of k values, vij V 2 j • * • Vk- Let the probability that Vi occurs be where 

k 

^p% = 1. Then in N independent trials, the probability that Vi occurs mi 

I 

times, V 2 occurs times, and so on, in a specified order (whatever it may be) is 

k 

where == iV, the m^s being positive integers or zero. The number of 

iT 

ways in which the order can be specified is the number of permutations possible 
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among N objects of which mi are of type Ti, m >2 of type T 2 , • • * nu of type 
Let this number be denoted by Then we have 


pNJ = 


Nl 

Mil m2! • ' • mjfc! 


Therefore, the probability that mi of the variates take the value f^i, m 2 the 
value vt, and so on, regardless of order is 

(5.75) /(mi, m 2 , • • • Mk) = p[m^]pl^^p^^^ • • • pk^L 
which is the general term of the expansion of the multinomial 

(pi + 7>2 + * * • + PkV 

The binomial law, for a simple dichotomy, given in Chapter II, is a special 
case of this law. Thus if fc = 2, the right member of (5.75) reduces to 

(5.76) 
v'here 

r = mi, N — r = m 2 ) p = Ph Q ^ Pi = Ph = Nl/mil m^i 

If i; is the number of spots appearing on the top face in a throw of a die, then 
V will take on one of the values 1, 2, 3, 4, 5, 6, and the probability of throwing 
exactly r aces (say) in N throws of the die is 

isr-r 



We recall that (5.76) is the general term of the expansion of the binomial 
(Q + p)^- By using Stirling's approximation for factorials, we can derive 
an approximation for (5.75) which will bear to the multinomial law a relation 
analogous to that which the normal curve bears to the binomial. With this 
obiective in mind, assume that every m^ is suflSiciently large for m*! to be 
replaced by its Stirling approximation. Making these replacements (5.75) 
becomes, after some algebraic rearrangement, 

k 

(5.77) /(wi, m 2 , • • • mi) = (^2xiV)<*-i)/2(pjpj . . . p^)m 

Next introduce the transformation 


(5.78) 


L =- 


m^ — Np^ 


(Ti 


being Wp»(l — p^. Under this transformation (5.77) becomes 

. . . p,)i/^/ = n (1 + 
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Then 

log L.M. = - crA - i) log (l + 

where L.M. denotes the left-hand member of the preceding equation. Upon 
expanding the logarithm in a power series and collecting the results according 
to descending powers of N, we obtain 

log L.M, == -“ 2 + 2^ ^ Wp lower order^ 

From (5.78), == ^mi — = 0, since = N and ^p^ = 1. 

Therefore, on substituting for we have 

f(mi, m2, • • • Mk) = (2irN)^’^-^^f^(pip2 • • • 

where 

For large N the second term in K is negligible, and moreover, since some 
of the U will be positive and some negative, it will tend to cancel out on sum- 
mation. Hence K ■“ Pz) as N increases. The form of K suggests 

the substitution of a new variable — p^y^^ in place of This gives 

(5.79) /(mx, m 2 • • • mfc) 

where Xt = (rm — Np^)(Npt)'~'^^^ and Npi is the expected value of m^. The 
multinomial distribution, therefore, tends to a joint normal distribution. 
The Xi are not independent, however, since^a^ip^^^^ == ^0 prove in the 

next section that has, in the limit as iNT 00 , the distribution of § 6.3 
with k —■ 1 degrees of freedom. 

5.12 Chi-square as a Measure of Sample Deviation. On the hypothesis 
that the probabilities corresponding to the various classes in a distribution are 
pi, p 2 * • • ph, the expected frequencies in a sample of N are Npi, Np^ • • • Npk- 
Hence the deviations of the sample frequencies from expectation are mi — Np\j 
• * • rrik — Npk* The quantity 

(5.80) xa^ = ^('m^ - Np^y/Np^ = 

i«.i 

is, therefore, a measure of the total deviation of the sample from expectation, 
and was so chosen by Karl Pearson, who proved that the limiting distribution 
of x»^ is the ordinary x^ distribution. 

The moment-generating function of the binomial distribution is, as shown 
in § 4.7, (g + pe^Y, For the multinomial distribution one can show, as a 
generalization of this, that the mgf of the joint distribution of the variables 
mi, m 2 , • • • Mk is 
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(5.81) M(wi, W2, * • • Uk) = (pie"* + + * • • + 

The mgf of the variables Xi, Xi, • • • Xk is therefore, by § 4.8, 

(5.82) Mi(mi, «2, ■ • • Uk) = 1_ 

The cumulant generating function is 

(5.83) K{ui, U 2 , ■ • ■ Uk) = log Mi{ui, M 2 , • • • mj) 

= -'^u,(Np^)^l^ + N log 

= + N log + 'Xu^ 

= + N log [l + 

= + 0(jv-i/2) . 

= iQ(Uh M2, • • • Uk) + 0(fV-‘/2) 

where 

(5.84) Q(ui, W2, • • • Uk) = — (]^M.p,^^^)^ 

This is a quadratic function of the variables ui, Ut, • • • Uk. 

From (5.83) we can prove that for any value of N 

(5.85) E(x,) = [^K/^u^]u, „i_o = 0 

(5.86) E{x^^) = Var (x,) = {d^K/du^\ = 1 - p; 

(5.87) E(xiX,) = Cov (a:,, x,) = [d^K/dUt du,]u, ^ot_o 

so that, since 

k 

X** = 

(5.88) E(x>^) = ;;^(l-p,) = fc - 1 
Moreover in the limit as iV qo , 

(5.89) K{ui, W2, • • • ua) iQ(wi, • * * wa) 

so that all cumulants of order higher than the second vanish. 

Hence the variables Xt are in the limit normally distributed about zero with 
variance 1 — They are not, however, independent, since 

(5.90) = 0 

i 

Let us now make an orthogonal transformation of the variables from 
xi, X 2 , • • • Xk to 2 / 1 , 2 / 2 , * Vk, defined by 

*0(iV'”3/2) means terms of the order of 
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f yi = CnXi + C 12 X 2 + • • • + CutXk 

(5.91) 

^ Vh = CktXl + Ck^Xz + • • • + C**Xib 

where we choose c*, = so that y* = 0. 

This transformation (see § 4.13) is such that 


(5.92) 

Then j/i® + 

(5.93) 


h 


1^1 

+ Vk^ = Xi^ + 


^ / 1, if i = i 
1 0, if i 9 ^ j 

+ Xk^ or, since yk == 0, 


k k-l 


The mgf of the joint distribution of the y^s is given by 


(5.94) 


M{ti, ‘ ik) 


=/' 


+v.t. 


dFiyi, - - - yk) 


Using ( 5 . 9 I) and the fact that the Jacobian of this transformation is 1, 
we find 


(5.95) 
where 

(5.96) 


M (fi, • • • tk) 




+X.«t *2, * • • Xk) 


f Wi = Cnti + • • • Ckitk 

U2 = Ci2^l + • • • + Ck2tk 

Uh = Ciiix -j- • • • 4- Cklik 


The right-hand side of (5.95) is the joint mgf for the variables Xi, X 2 , • • • x*, 
considered as a function of Wi, Ws, • • • Uk- Hence in the limit when iV 00 , 

M{k,hr-tk) 

Now 

(5.97) Q(ux, 1^2, •• • Uk) = 2^*2 - (2t^iP»^/2)2 

= - (]^UiCt,)* 

= 2^*“ - *** 

as is seen by multiplying the rows of (5.96) by c*i, cw, • • • c**, respectively, add- 
ing, and using (5.92). Therefore ^ ^ 

(5.98) M(ii,f2, •••<*)= 

which shows that the variables yt, ^ 2 , • • • yk^i are independently and normally 
distributed with zero means and unit variances, while yk = 0. 

Hence, since %? is a sum of squares of these variables, x»^ as defined by 
(5.80) has in the limit as iV^ oo the chi-square distribution described in 
§ 5,3, with fe — 1 degrees of freedom. 
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5.13 The Chi-square Test of H3rpotheses. Let H stand for the hypothesis 
that a sample of N individuals forms a random sample from a population with 
a given probability distribution; that is, the parameters of the distribution arc 
known or assumed and are not estimated from the sample itself. We calculate 
X,® from (5.80) and determine the probability of obtaining, on the hypothesis 
H, a value of x.^ at least as great as this. This probability, on the assumption 
of the limiting chi-square distribution, is given by 


(5.99) 

where, by (5.24), 

(5.100) 


Pr[x* > x.^] = r fix^) 


/(x“) d(x“) 


The integral in (5.99) is readily expressed as an incomplete Gamma function, 
by the substitution jx^ = u. In fact. 


(5.101) 


Pr [x^ > X.'] = 1 - Tp 


= 1-/1 




^2{k - 1 )’ 2 , 


so that the probability can be calculated from tables of the incomplete Gamma 
function for given values of A; — 1 = n. Separate tables of have been 
calculated and may be found in Part I of Pearson's Tables for Statisticians and 
Biometridans, particularly Table XII. The n' of this table is our fc, which 
is one more than the number of degrees of freedom. Tables of x^ for one 
degree of freedom (not given in Pearson's tables) are found in the Appendix to 
Yule and Kendall's Theory of Statisticsj Tables 4A, 4B. 

It is often convenient to arrange a table of x^ as was done by R. A. Fisher 
(see Table III of the Appendix to this book) with values of corresponding 
to selected values of the probability. A more complete table of this type is 
given by Thompson in Biometrikaj 32, 1941, p. 187. Thus for a probability 
of 5% with 10 degrees of freedom, the tabular value of is 18.31. 

If the sample value x«^ exceeds this when A; = 11, and if the sample number 
N is so large that the distribution of x«^ is practically identical with that of x^, 
the chance is less than 5% that if hypothesis H were true a random sample 
would give a value of Xs^ as great as or greater than the one actually obtained. 
In other words, if we agree to work on the 5% level of significance, we shall 
reject the hypothesis. To be still safer, of course, we could work on the 1% 
level, and then we should reject H only if x/ > 23.21 for A; = 11, 

If, on the other hand, is less than the tabular value corresponding to the 
appropriate k and the assigned level of significance, we can say that our 
sample is consistent with hypothesis H, This does not mean tha^H is true; 
merely that the sample supplies no evidence against it. 
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There is^ of course, a possibility that we shall reject H when it is really true. 
The probability of doing this is the same as that of getting by chance a value 
of x/ > X 2 )^ where Xp^ is the tabular value corresponding to p%, and hence 
this probability is equal to p/100. We shall, therefore, commit this kind of 
error (known as the first hind) in about p% of all cases in which we apply the 
criterion. This is why we speak of working on the p% level of signifi- 
cance. 

It occasionally happens that the value of from the sample is unexpectedly 
small, corresponding to a probability of nearly 1. In such a case the fit is too 
good, and it is highly likely that even if the hypothesis H were true we should 
get discrepancies greater than those actually observed. We may well suspect 
that our sample is not a truly random sample from the hypothetical popula- 
tion. 

With regard to levels of significance, Fisher*^ says: 

In preparing this table we have borne in mind that in practice we do not want to 
know the exact value of P for any observed x^, but, in the first place, whether or not 
the observed value is open to suspicion. If P is between .1 and .9 there is certainly 
no reason to suspect the hypothesis tested. If it is below .02 it is strongly indicated 
that the hypothesis fails to account for the whole of the facts. We shall not often 
be astray if we draw a conventional line at .05, and consider that higher values of 

indicate a real discrepancy. 

Just as the binomial distribution tends much more slowly to the normal 
distribution when p is very small than when it is around 0.5, so we may expect 
that the distribution of will deviate appreciably from the limiting chi- 
square distribution if the expected numbers in some of the classes are small. 
This will often happen at the tails of the distribution, so that it is the usual 
practice to pool small adjacent classes until no class has fewer than five 
individuals in it. 

Cram4r® recommends pooling imtil the expected number in any class is at 
least 10. On the other hand, Cochran ® has shown that, in some cases at least, 
numbers as small as 1 may be permitted without seriously affecting the validity 
of the test. It seems that poolmg tends to increase the calculated probability, 
and hence to diminish the chance of rejecting the hypothesis (see Problem 7 
below). 

A convenient chart of and P for various values of n is given in the Appen- 
dix to Yule and KendalFs textbook. Another useful chart has been prepared 
by C. 1. Bliss.io 

S.14 The Chi-square Test Applied to Curve Fitting. In most cases arising 
in practice, the hypothesis that we wish to test is that our sample has been 
dra^vn from a parent population of a certain type (e.g., a normal distribution), 
but with parameters that are not specified. In fact, we use the sample itself 
to estimate these parameters. 
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Let us suppose, for convenience, that there are two such parameters, 6i 
and 6i (the argument works equally well for any number). The probability 
Pi that a single item in the sample falls in the ith. class is a function of di and 02 , 
say Pt(0i, 02 ). If we knew the true values of 0i and 02 , we could calculate 


(5.102) 


2 _ "V — fVPi(01, 02)]^ 
iVp,(01,02) 


and apply the ordinary test. Actually, we replace 6i and by their esti- 
mates $1 and § 2 , but this means that the pt depend on the sample values, and 
we cannot assume that the limiting distribution of is still the distribu- 
tion. It may be shoAvn, however (as, for example, by Cramer that for an 
important class of estimates known as most-efficient estimates (see § 12.3) the 
limiting distribution as iV oo is the same as that obtainable by making 
a minimum with respect to 6i and 02 , and in fact is the x^ distribution with 
A; — 3 degrees of freedom (fc — a — 1 if there are s parameters to be estimated 
from the sample). This seems reasonable, since each additional parameter 
introduces a further restriction on the Xt analogous to (5.90), and so reduces 
by one the number of degrees of freedom. The analogy is not exact, however, 
and the foregoing statement does not constitute a proof. 

As mentioned in § 5.8 the estimates given by the method of moments are 
not, in general, most-efficient (see § 12.6). The x^ test for curves fitted by 
moments may, therefore, be unreliable except in special cases. 

In fitting a continuous curve to a grouped distribution it is necessary to 
calculate the area under the curve corresponding to each class interval. 
This area is equal, in the standardized curve, to the difference of the values of 
the distribution function at the beginning and end of the interval, and is multi- 
plied by the total frequency N to get the calculated frequency Np^ in the 
interval. 

Since the signs of the differences m^ — Np^ are ignored in the calculation of 
x/, it may happen that a very improbable distribution of signs may still give 
a value of x»^ too small to reject the hypothesis H. Runs of one sign are 
more likely to occur near the tails of the distribution and may be hidden by 
pooling. 

It is recommended that the student read ‘‘The x^-Test of Significance^^ by 
T. C. Fry j Jour, Amer, Stat, Assoc.j 33, 1938, pp. 513-525. The three papers 
foOowing Fry^s exposition are also worth reading. 

For the application of x^ to contingency tables, see Chapter VIII. 

Example 1. Twelve dice were thrown 4096 times; only a throw of six was counted a suc- 
cess. The expected frequencies are given by 4096(1 + f).i2 How improbable, taken as a 
whole, is the observed distribution shown in Table 3? The symbol m is used for the expected 
frequencies Npi. 
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Table 3 


Number of 

8 uccesseb 

Observed 

Frequency 

Theoretical 

Frequency 

{m — rky 

{m — mf 
m 

0 

447 

459 

144 

.3137 

1 

1145 

1103 

1764 

1.5993 

2 

1181 

1213 

1024 

.8442 

3 

796 

809 

169 

.2089 

4 

380 

364 

256 

.7033 

5 

115 

116 

1 

.0086 

6 

24 

27 

9 

.3333 

7 and over 

8 

5 

9 ; 

1.8000 

Totals 

4096 

4096 


= 5.8113 


Entering Table III (see Appendix) with n ~ 8 — 1 = 7, and extrapolating for the value of 
P corresponding to the observed value of x® = 5.8113, we find P = .56. Hence there is no 
reason to reject the hypothesis that the underlying chance of a “success’^ is p = That 
IS, there is no reason to suspect that the dice were biased. 

Example 2. In § 5.9 w’e fitted a normal curve to the data of Table 2 on weights of Glas- 
gow school children. We will now test the goodness of fit. 

We first calculate standard t values corresponding to each value of X^i and find from a 
table of the normal law' the respective values of 4>(0. Differences of successive entries give 
the area under the standard normal curve for each class interval, and these areas are multi- 
plied by 1000 to give the calculated frequencies /c, with results shown in Table 4. The 
column headed /o gives the observed frequencies previously denoted by w*. 

Table 4. Weights of 1000 Glasgow School Childkbn 


x..(lb) 

t 

m) 

A4>(t) 

fc 

fo 

/o-/c 

fc 

31.5 

35.5 



m 

2.5 

14.7 

1 

14 


0.3 

39.5 

-- 1.422 

.0775 

■a 

60.3 

56 

-4.3 

0.3 

43.5 



,1555 

155.5 

172 

16.5 

1.8 

47.5 


.4852 

.2522 

252.2 

245 

-7.2 

0.2 

51.5 


.7441 

.2589 

258.9 

263 

4.1 

0.1 

55.5 

1.349 

.9113 

.1672 

167.2 

156 

-11.2 

0.8 

59.5 


.97§4 

.0681 

68.1 

67 

-1.1 

0.0 

63.5 

2.734 

.9969 

.0175 

17.5 

23 

5.5 1 K 

1.4 

00 

00 


.0031 

3.1 

3 

! 







4.9 


From the last column we see that is 4.9, after pooling two classes at the tails of the 
distribution. The number of classes is thereby reduced to 8, and the number of degrees 
of freedom to 5, since two parameters, estimated from the sample, have been used in cal- 
culating t. The value of P is 0.44, so that there is no evidence, as far (is this test goes^ that 
the weights of Glasgow children are not normally distributed. 
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Example 3. The data in Table 5 refer to reaction-time differences to a light stimulus, 
with and without warning, in units of 1 millisecond. (X = reaction time without warn- 
ing — reaction time with warning.) 

Table 5. Reaction-time Differences to Light Stimulus 


A ei 

t 

Fit) 

mt) 

fc 

/o 

h - 


{/« -fcY 

fr 

-175 

-2.754 

00027 

.00027 

.08 

1 

- 92' 



-125 

-2.391 

.00208 

.00181 

.56 

0 

-0 56 

1.50 

0.21 

-75 

-2.028 

.01005 

.00797 

2.48 

1 

-1.48 

-25 

-1.665 

.03377 

.02372 

7.38 

7 

-0.38 . 



25 

-1.302 

.08570 

.05193 

16.15 

16 

-0.15 


O.CK) 

75 

-0.938 

.17458 

.08888 

27.64 

29 

1.36 


.07 

125 

-0.575 

.29762 

.12304 

38.26 

40 

1.74 


.08 

175 

-0.212 

.44113 

.14351 

44.63 

47 

2.37 


.13 

225 

0.151 

.58553 

.14440 

44.91 

42 

-2.91 


.19 

275 

0.514 

.71347 

.12794 

39.79 

41 

1.21 


.04 

325 

0.878 

.81517 

.10170 

31.63 

26 

-5.63 


1.00 

375 

1.241 

.88810 

.07293 

22.68 

25 

2.32 


.24 

425 

1.604 

.93623 

.04813 

14.97 

16 

1.03 


.07 

475 

1.967 

.96504 

.02881 

8.96 

8 

-0.96 


.10 

525 

2.331 

.98249 

.01745 

5.42 

5 

-0.42 


.03 

575 

2.694 

.99148 

.00899 

2.80 

5 

2 .20] 



625 

3.057 

.99604 

.00456 

1.42 

2 

0.58 

^ 1.55 

.44 

00 

OD 

1.0000 

,00396 

1.23 

0 

-I.23J 





1 



311 



2.60 


The first four moments of the distribution of observed frequencies (with. Sheppard’s 
corrections, although the total frequency is rather small for these) give 

ki — 204.2 msec 
k 2 = 18,951 msec* 
gi - 0.372 
gi = 0.009 

Hence the estimated parameters of the parent population are 

M == 204.2 msec 
<r = 137.7 msec 
- 0.372 
T> = 0.009 
/3i =- 0.139 
j 82 = 3.009 

The standard errors of gi and gi are roughly 0.15 and 0.36. (See Problem 19, Chapter 
VII.) For a Pearson Type III curve, we should have 272 - Syi* = 0, and this is sufficiently 
nearly fulfilled. On the Pearson diagram of Types the point corresponding to (/3i, /S^) lies 
near the beginning of the line corresponding to Type III. 

The skewness may be taken as 0.4 approximately, and the values of F{t) in Table 5 are 
read from Salvosa’s tables for 71 = 0.4. As before, /o =* 31 lLF{t), The value of x-* turns 
out to be 2.60 after pooling a few classes at the ends of the distribution. The number of 
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degrees of freedom is 13 — 4 ~ 9, since 3 parameters were estimated from the sample. The 
value of P IS about 0.98, so that the fit is very good, almost suspiciously good. 


5.15 The Log-normal Distribution. It sometimes happens that if a variable 
X is markedly skew in its distribution, log X is approximately normal. This 
is so, for instance, if X is affected by many random causes, each of which 
produces a small effect proportional to X itself. For, if AX = cX, A log X = 
(1/X) AX = c, and the resultant of many random causes, each producing a 
small constant effect, is a normal distribution. 

A rough test of whether a distribution is log-normal may be made by plotting 
the distribution on logarithmic probability paper or by plotting log X against 
the probit corresponding to X. Either natural or common logarithms may 
be used. In either case the curve should be nearly straight. 

Let Y = logc X and suppose the distribution of Y is normal. The frequency 
function for Y is piY) = where M and S are the mean 

and standard deviation of the F distribution. If m and s are the mean and 
standard deviation of the X distribution,* 


(5.103) 


m 




vOCjXdX 
= p p{Y)e^dY 
_ — 1_ r gF-(r-iif)V2S2 fiY 

V2rSj^J 




1 p 




ffl = 


whence 
or 

(5.104) logm- JIf + 

If we use common logarithms, M and S are replaced by Mi/c and Si/c respec- 
tively, where Mi and Si refer to the distribution of logic X and c = 0.4343 • • • , 
so that 


(5.105) 

Again, 


logic m = c log m = Ml + 


2c 




= jf” p(X)X^ dX 
p(Y)e^^ dY 


r 


*m and as well as M and Sf are here population parameters, not sample statistics. 
See §6.1. 
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Therefore 

o 2 

1 + — , = 

so that 

(5.106) /S2 = log^l + ^ 2 ) 

If common logarithms are used, 

S{- = clogifl(l + “) 

Hence, from (5.104), 

(5.107) M = log [m{l + or Mi = logio {rrii{l + 

In an article in Nature, October 20, 1945, J. H. Gaddum has given many 
examples of distributions which are approximately log-normaL These include 
size of silver particles in photographic emulsion, survival time of bacteria in 
disinfectants, number of plankton organisms caught in a net, weight and blood 
pressure of human beings, and number of ^vords in a sentence written by G. B. 
Shaw. 

A modified logarithmic transformation Y = log (X + Xo) may be useful if 
the curve of probits against log X shows a more or less constant curvature. 
If the curve is convex upward, Xo is negative; if concave upward, Xo is posi- 
tive. A rough estimate of Xo may be made from the graph. If Xi, X 2 , Xz 
are values of X corresponding to three equidistant points on the probit scale, 
say 4 , 5 , and 6 , then X 2 + Xo is the geometric mean of Xi + Xo and X 3 + Xo, 
so that 

r X 2 ^ - X 1 X 3 
X 1 + X 3 - 2 X 2 

The value so obtained can be tested by plotting log (X + Xo) against the 
probits and adjusting if necessary. It is claimed that various empirical 
distributions can be fitted better in this way than by any curves of the 
Pearson system. 

Problems 


1. The naturalist Buffon (1708-1788) tossed a coin 4040 times and obtained 2048 heads 
and 1992 tails. Use the table of x* to test whether this result is reasonable, on the hypothesis 
that the true probability of a head with the com Buffon used was Ans, The probability 
of a discrepancy at least as great is 0.366. 

2. In one of his experiments with peas, the Ahh6 Mendel observed the shape and color 
of peas from a number of plants, and found the followmg distribution: 


Round, yellow 315 

Round, green 108 

Angular, yellow 101 

Angular, green 32 


According to MendeFs theory of heredity, the expected numbers should be in the ratio 
9* 3: 3:1. Calculate these expected numbers, and determine by the x* test whether the 
agreement with theory can be regarded as satisfactory. 
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3. Toss seven coins 128 times and record the frequencies of h^ads. Apply the test 
to the resulting distribution. 

4. 1000 discs were put m a goldfish bowl, each disc bearing a number ranging from 0 to 
24. A single disc was drawn 1000 times, the number being noted and the disc put back and 
mixed with the others after each drawing. The actual frequencies /c of the discs m the bowl 
and the observed frequencies /o in 1000 drawings are shown in the table. Are the results 
consistent with the hypothesis that the method of drawing was really random? 


X 

0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

/» 

0 

27 

57 

87 

105 

109 

94 

89 

78 

78 

46 

65 ' 

35 


1 

23 

61 

92 

106 

100 

94 

87 

78 

69 

59 

49 

40 


X 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

/. 

19 

37 

18 

18 

6 

11 

15 

7 

6 

2 

1 

0 

fc 

32 

26 

20 

16 

13 

10 

8 

6 

4 

3 

2 

1 


6. In 500 throws with a die, 6 has turned up 98 times. Is this number large enough to 
cast suspicion on the die? 

Hint. Prove that the probability of 98 or more sixes with a good die is about 0.04. This 
particular event is, therefore, rare enough to appear significant. The probability of a 
deviation from the expected number at least as great as that observed, in eiiher direction 
and/or any face of the die, is, of course, much greater and is not significant. The occurrence 
of this unexpectedly large number of sixes cannot, therefore, by itself, be regarded as good 
evidence of bias. A sounder opinion could be formed by examining the whole distribution 
(if available) of observed numbers of spots for the 500 throws of this die, and using the x* 
test. 

6. Suppose that the whole distribution referred to in the hint following Problem 5 is 
that given below, where X is the number of spots on the upper face of the die. 


X 

1 

2 

3 

4 

5 

6 

f 

71 

78 

85 

82 

86 

98 


What opinion would you form of the accuracy of the die? Am. The probability of a value 
of X® at least as great as that observed is about 0.42. The observed result is, therefore, quite 
reasonable with a good die, 

7. In a study of plant disease (spotted wilt of tomatoes), the numbers of diseased plants 
were counted in each of 160 groups of plants. Each group contained 9 plants evenly spaced, 
so that the number X of diseased plants could take integral values from 0 to 9. The distri- 
bution was’^as follows: 


X 

0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

/ 

36 

48 

38 

_ 

23 

10 

3 

1 

1 

0 

0 


Assuming that the probability of being diseased, p, is constant, fit this distribution with the 
binomial 160 (g + p)®, estimating p from the sample. Test the agreement by the x* test by 
(o) pooling the last four frequencies, (6) pooling the last five, (c) pooling the last six, and 
noting in each cam the value of Pr{x* > x#*}. Am, (a) ,004, (h) .037, (c) .047. 
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The usiual pooling procedure leads hero to an over-estimate ol the probability. It also 
disguises the fact that the differences between observed and calculated frequencies corre- 
sponding to X = 4, 5, 6, 7j are all of the same sign. The fit is really poorer than procedure 
(c) would suggest. 

8. Prove that ift = {x^ — k + l)/(2k ~ 2)^/^ the distribution of t is Pearson Type 

with skewness 2[2/{k — 1)]^^®. Hence Salvosa^s tables of the Type III distribution may be 
used to calculate Pr{x^ > Check by taking one or two values of x® and k, calculating 

the corresponding t, and entering Salvosa^s tables. 

9. In a certain experiment involving counting yeast cells wnth a haemacytometer, the 
count in each of 400 squares varied between 1 and 12, with the distribution of frequencies 
shown m the attached table, where X represents the cell count. 


X 

1 

2 

3 4 

5 

6 

7 

8 

9 

10 

11 12 

/o 

20 

43 

53 86 

70 

54 

37 

18 

10 

5 

2 2 

Show that this distribution is fitted 

very 

w^ell 

wdth a 

Poisson 

curve. 

Note that in the 


parent population X may take any integral value from 0 to oo . All theoretical frequencies, 
from X — 1 1 on, may be pooled, however. 

10. It has been observed by some hydrographic engineers that the distribution of the 
maximum 24-hour run-off during a year, over a long series of years, follows for many rivers a 
Pearson Type III distribution, with a skewness of approximately 0.6. Test this hypothesis 
on the following data for the Merrimack River, where X is in thousand cu ft/sec and the 
class intervals run 20 to 30, 30 to 40, etc. 


"x 

20- 

30- 

40- 

50- 

60- 

70- 

80- 

90- 

/ 

12 

24 

17 

12 

9 

5 

2 

1 


11. The following observations on maximum 24-hour run-off for the North Saskatchewan 
River form too short a series for the skewness to be estimated with much reliability. Assum- 
ing, however, that the skewness is really 0.6 (see Problem 10), test approximately whether the 
distribution is Pearson Type HI by forming a cumulative frequency distribution in the wny 
described in section 5.10 for ungrouped data, plottmg on probability paper, and comparing 
with the theoretical Type III curve on the same sheet. It may be noted that if we use 
hganihmic probability paper, or if log X is plotted against the probit, the curves are much 
more nearly straight than if arithmetic probability paper is used. In the table X has been 
expressed as a percentage of the mean, so that the estimated expectation of X is 100. 


Year 

X 

Year 

X 

1911-12 .... 

123.4 

1920-21 .... 

75.3 

1912-13 .... 

.... 102.5 

1921-22 .... 

73.2 

1913-14 .... 

.... 83.8 

1922-23 .... 

105.0 

1914^15 .... 

.... 138.8 

1923-24 .... 

85.6 

1915-16 .... 

.... 117.8 

1924r-25 .... 

113.7 

1916-17 .... 

.... 116.8 

1925-26 .... 

99,2 

1917-18 .... 

.... 88.1 

1926-27 .... 

124.0 

1918-19 .... 

.... 67.0 

1927-28 

120.6 

1919-20 

.... 98.4 

1928-29 

. . . . . 66.1 


12. Fit the following data, on weights of college freshmen (to the nearest pound), with 
three terms of an Edgeworth senes. The values of X« refer to the true ends of the class 
intervals, the interval being 11 lb. 
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/o 

110.5 ....... 

. 15 

121.5 ...... 

43 

132.5 ...... 

138 

143.5 

162 

154.5 

129 

165.5 

82 

176.5 

35 

187.5 

16 

198.5 

5 

209.5 

3 

220.5 

1 


The first four (corrected) moments for this 
sample may be taken to be 
m - 142.25 lb 

mz = 309.46 Ib^ 

W3 = 3,431.2 Ib^ 

W4 = 353,980 lb" 

The w’s are defined in § 5 8. 


13. If N observations are distributed among four classes, with a, b, c, d in the respective 
classes (a + b4-c+d=A^) and if the probability that an observation falls in the tth class 
is pi(i = 1, 2, 3, 4), show that (a) the expected value of a linear function of the observed 

frequencies, x = kia -h kih + k^c + W, is given by E{x) — N ^^ktP%, (h) the variance of 
X is given by Var (x) = — (^hp%y\, (c) the covariance of two linear functions 

X and 2/, where ^ = hia -h hzb + hsc -h hid, is given by 
Cov (i, y) = N'{'^Jl,p^ - 

Hint, The joint probability of o, 2>, c, d is the multinomial term 


References 

1. C, E. Weatherbum, A First Course in Mathematical Statistics, Chap. VIII. 

2. E. B, Wilson Sc M. M. Hilferty, Proc. Nat. Acad, Sd,, 17 , 1931, p. 684. The approxi- 
mation to the exact values calculated by C. M. Thompson {Biomeirika, 32, 1941, 187) is 
remarkably good. 

3. W. G. Cochran, ^‘The Distribution of Quadratic Forms in a Normal System,^’ Proc, 
Camh, Phil. Soc., 80, 1933-4, p. 178. See also M. G. Kendall, Advanced Theory of Statistics, 
Vol. II, p. 177. 

4. W. P. Elderton, Frequency Curves and Correlation, 3rd Edition, Chap. IV. This 
book is a compendium of information on the Pearson types. 

6. C. C, Craig, New Exposition and Chart for the Pearson System of Frequency 
Curves,’"' Ann, Math, Stat., 7 , 1936, p. 16. 

6. See H. Cramer, Mathematical Methods of Statistics, pp. 223-224, for a discussion with 
references. 

7 . R. A. Fisher, Statistical Methods for Research Workers, 10th Edition, p. 80. 

8. H. Cram(5r, he. cit., p. 420. 

9. W. G. Cochran, ^^The Chi-square Distribution for the Binomial and Poisson Series, 
with Small Expectations,’’ Ann. of Eugenics, 7 , 1936, p. 207. 

10. C. I. Bliss, ‘‘A Chart of the Chi-square Distribution,” Jour. Amer, Stat. Assoc., 39, 
1944, p. 246. 

11. H. Cram4r, he, dt., pp. "425-441 and p. 506. 

12. See, e.g., L. 8. Hall, Trans. Amer. Soc. Civil Engineers, 84, 1921, p. 191. Also dis- 
cussion on pp. 241-257. 

13. Types IX to XII are described in Karl Pearson’s Second Supplement to a Memoir 
on Skew Variation,” Phil. Trans. A, 216, 1916, pp. 429-457. This paper is reprinted in 
Pearson’s Early Statistical Papers, Cambridge Univ. Press, 1948. 



CHAPTER VI 


FUNDAMENTALS OF SAMPLING THEORY WITH SPECIAL REFERENCE TO 

THE MEAN 

6.1 Introduction, In many statistical problems the data at hand are re- 
garded as a random sample drawn from a parent population or universe of dis-- 
course and we are interested in drawing inferences about the universe from the 
sample. The phase random sample’^ implies that each individual from the 
universe has an equal and independent chance to be included in the sample. 
From such samples we attempt to draw inferences concerning the universe. 
In order to deal with this inductive argument we first consider a deductive 
argument; that is, we first consider an infinite (or finite) universe and investi- 
gate the behavior of samples according to the laws of probability. The 
methodology dealing with this class of problems is known as sampling theory. 
The center of interest in sampling theory is the development of criteria for 
assisting common sense or educated judgment concerning the magnitude of 
chance fluctuations in statistical ratios, averages, and coefficients. 

The Bernoulli theory deals with sampling fluctuations in relative frequencies. 
In the words of Rietz/ 

But it is fairly obvious that the interest of the statistician in the effects of sampling 
fluctuations extends far beyond the fluctuations in relative frequencies. To illustrate, 
suppose we calculate any statistical measure such as an arithmetic mean, median, 
standard deviation, correlation coefficient, or parameter of a frequency function from 
the actual frequencies given by a sample of data. If we need then either to form a 
judgement as to the stability of such results from sample to sample or to use the results 
in drawing inferences about the sampled population, the common-sense process of 
induction involved is much aided by a knowledge of the general order of magnitude 
of the sampling discrepancies which may reasonably be expected because of the limited 
size of the sample from which we have calculated our statistical measures. 

A statistical measure calculated from the actual frequencies given by a 
sample has been called a statistic by R. A. Fisher.^ This is to avoid a verbal 
confusion with the corresponding parameter in the universe which we should 
like to know but can generally only estimate. It is a matter of common 
experience that a statistic will vary from sample to •sample. To characterize 
the -variation that may be tolerated on the basis of chance is one of the funda- 
mental problems of sampling theory. 

In discussing such sampling fluctuations, Fisher ® introduces the subject 
as follows: 
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Fimdamentals of Sampling Theory 

The idea of an infinite population distributed in a frequency distribution in respect 
of one or more characters is fundamental to all statistical work. From a limited 
experience, for example, of individuals of a species, or of the weather of a locality, we 
may obtain some idea of the infinite hypothetical population from which our sample 
is drawn, and so of the probable nature of future samples to which our conclusions 
are to be applied. If a second sample belies this expectation we infer that it is, in the 
language of statistics, drawn from a different population; that the treatment to which 
the second sample of organisms had been exposed did in fact make a material differ- 
ence, or that the climate (or methods of measuring it) had materially altered. Critical 
tests of this kind may be called tests of significance, and when such tests are available 
we may discover whether a second sample is or is not significantly different from the 
first. 

6.2 Method of Attack. The whole theory of sampling is based on frequency 
distributions and probability. In order to explain the tests of significance 
that have been developed, it is desirable to outline briefly the philosophy 
underlying the method of attack. 

Sampling theory deals with specific questions like the following: Given the 
mean and standard deviation of a sample of N variates, how reliable are these 
as estimates of the population mean and standard deviation, respectively? 
Given two samples, do their respective means or other statistics differ signifi- 
cantly? Can the differences be accounted for on the basis of chance or do 
the samples come from different populations? The answers require in general 
that we conceive the universe as one distribution, the values of the statistic 
calculated from all possible samples of size N from that universe as another 
distribution, and that there are mathematical expressions capable of represent- 
ing both distributions. This is the chief reason for studying frequency curves 
and probability distributions. 

Suppose, for example, that we have computed a statistic — say the mean 
of 100 observations or measurements. What we get is not an absolutely fixed 
quantity which may be exactly reproduced again by taking 100 similar 
measurements. Indeed, if such an experiment were repeated many times 
we would get values for the arithmetic mean which would form a frequency 
distribution. This distribution would have its own mean (mean of means) 
standard deviation, and higher moments. The law describing the frequency 
distribution of all possible means of samples of size N from a specified universe 
is called a frequency function when it can be expressed mathematically. 
What has been said of the mean holds for any other statistic. 

Formulation of statistical judgment about a sample involves the specifica- 
tion of the universe and the determination of the frequency function of a given 
statistic in samples of a given size drawn from this universe, The problem of 
determining the frequency functions for the various statistics from specified 
universes is one which has challenged modem mathematical research. In 
most cases it has been necessary to assume that the parent universe is of the 
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normal form in order to obtain analytically the sampling distribution of the 
statistic. Many of the tests of significance are based upon this assumption. 
However, considerable information about sampling distributions from arbi- 
trary universes is known in terms of their moments. 

63 Point Estimation and Interval Estimation. Problems of estimating the 
actual values of ' population parameters are said to belong to point estima- 
tion.^^ It is equally important to know within what limits we can confidently 
expect the parameter values to lie, with any specified degree of confidence, and 
such problems belong to ^^nterval estimation.’^ They can be solved if we 
know the distribution of the statistic t used for estimating the parameter B; 
the distribution function will, in general, depend on 6 as well as possibly on 
other parameters. 

It was formerly the custom to attach to the definitions of common statistics, 
such as the arithmetic mean or coefficient of correlation, a formula giving the 
so-called probable error of the statistic. The following concise exposition of 
the various usages of the term probable error ” is due to Professor A. T. Craig. 

There are in the literature three conceptions of the probable error. If, 
purely for convenience of language, we refer to the probable error of the mean, 
these conceptions can be stated as follows: (i) The probable error of the mean 
is that deviation, extended on both sides of the mean of the population^ such 
that J is the probability that the mean of a sample will fall in this interval; 
(ii) The probable error of a mean is that deviation, extended on both sides of 
the mean of a sample, such that | is the probability that the mean of the popu- 
lation lies in this interval; (iii) The probable error of the mean is that devia- 
tion, extended on both sides of the mean of a sample, such that | is the prob- 
ability that the mean of another sample will fall in this interval. Conception 
(i) leads without difficulty to the usual formula .6745(<7/ ViV) for the probable 
error of the mean. This formula is rigorously correct for samples of any size 
drawn from a normal population and is vahd for large samples drawn from any 
population with finite variance. On the other hand, the formula cannot be 
established under conception (ii) without further assumptions. If, before 
the sample is drawn, it is assumed, in the absence of any knowledge concerning 
the distribution of possible values of the mean of the population, that the 
probability is constant, then the formula admits mathematical proof. But 
this assumption is essentially the same assumption as that made in applying 
Bayes’ Theorem to problems of probability a posteriori. 

The modern method of expressing the reliability of a statistical estimate of a 
population parameter in terms of confidence intervals seems likely to replace 
the traditional but often misleading mode of expression involving probable 
error. Conception (i) is seldom useful in practice, as the true population 
parameters are usually" unknown. Conception (ii) seems to depend on a rathcjr 
artificial probability scheme. If a sample is drawn from a fixed population, 
and an interval is calculated centering on the sample mean, the true popula- 
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tion mean either does or does not lie in this interval. The probability that it 
does so is either 0 or 1 . The population mean is not, in fact, a random variable 
at all. It is the sample mean that is the random variable. If we take many 
samples from the same population we can, however, calculate for each an inter- 
val, which will vary from sample to sample, but which will be such that if, 
on each occasion, we assert that the population mean lies inside it, we shall be 
right in a definite percentage, say 90%, of such occasions. An interval thus 
calculated is, in the language of J. Neyman and E. S. Pearson,^ called a 90% 
confidence interval. It has a definite probability, 0.9, of covering the true 
value, whatever the true value may be. 

On the other hand, conception (ii) depends on regarding the parent popula- 
tion not as fixed but as itself a random sample from a hypothetical family of 
parent populations with all possible values of the parameter Q. The a 
'posteriori probability of 6, given the sample value t of the statistic used to 
estimate can then be calculated by Bayes^ Theorem, provided some assump- 
tion is made about the a priori distribution of B. This assumption is usually 
the one made by Bayes himself, and strongly attacked by Fisher (see § 1.10), 
that before the sample is drawn aU values of B are equally hkely. 

The invalidity of this assumption in many applied problems of statistical 
interest may be seen clearly in cases of a continuous frequency function with 
a derivative. SujTpose that our initial assumptions relating to a parameter B 
were such that B would initially be distributed in accord with a continuous 
frequency function, g(B), which has a derivative at each point within its possi- 
ble range on say from B - a to B = p. Next, suppose g(B) were restricted 
to be constant throughout the range of B. Then it is well known that the 
distribution of a simple non-linear fxmction of B would not be constant. For 
example, the distribution of 9 ^ 1, B real and non-negative) would not 

be constant, but would be distributed in accord with a frequency function 
(l/n)z^^"'">/”. But if ^ is a population parameter, it seems fairly obvious that 
the logical character of our theory should usually, if not always, be such as 
to enable us to use a power of 0 as a parameter if we found it convenient to 
do so. 

The preceding introduction is designed to lead up to the important fact that, 
although in the usual statistical inquiry by sample, the true value of the popu- 
lation parameter B is unknown and remains unknown, there are cases in which 
precise statements can be made in terms of probabilities about the bounds 
within which a parameter B lies without making an assumption about the initial 
distribution of the possible values of B. 

6.4 Confidence Belts and Limits. For simplicity, consider a case of a single 
parameter, $, in which we know the frequency function of the statistic, to be 
given by an integrable function 


( 6 . 1 ) 


Vt == /(«, 
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where the values of t obtained from observation may be assumed to be reason- 
able estimates of 6, Suppose we know (6.1) in such form that it is possible 
to calculate a table of values of the probabilities that the statistic, t, will fall 
into an assigned interval selected on a possible range (a, h) for any assigned 
value of 0 within the possible range (a, of 6. 

Next, for illustration, select a positive number €, say e = .005, on which to 
base a certain level of confidence about values of 6 to be expressed in terms of 
probabilities. 

As our main problem may be clarified by a geometrical representation, 
conceive of corresponding values of t and 6 obtained in an extensive statistical 
experiment as represented by rectangular 
coordinates within the rectangle bounded 
by lines t = a, t = b, 6 = 9 — ^ 

(Fig. 16). 

Consider an arbitrary assignment for dj 
say that 0 is the true value of 
This gives the line AB (Fig. 16). Since 
the distribution of the statistic t is as- 
sumed to be known for each assigned 
value of 6j we may locate on the line 
AB two points, h and h (h ^ U) such 
that € is equal to the probability that 
a random sample will yield a value of 
t less th an or equal to ti, and similarly e is the probability that such a sample 
vdll yield a value greater than or equal to h- Then we have an interval on 
AB from h to such that 1 — 26 is the probability that the random sample 
will yield a value within this interval. 

More formally stated, we may introduce a function F(tj 6) defined as the 
definite integral of f{tj 6) in (6.1) from t = a to t. That is, 



F(i, 9 ) 


-I 


fit, 9) dt 


for any arbitrarily assigned real value of 6 on its range from a to Then 


F(a, 9) = 0, Fib, 9) = 1, Fik, 0') = e, Fik, 6') ^ 1 - e, 

(0 < € < ^) 

By considering all possible assignments of 9, in its possible range (a, fi), the 
locus of our set of lower values of t, illustrated by h on the line AB, will give a 
continuous curve which we mark with C* in Figurp 46, the subscript e being 
used to remind us that e is the probability that a random value of ^ for ^ = 6' 
will fall below or at k. Similarly, our set of upper values of t, illustrated by 
k cm AB, give a curve which we mark with Ci-^. 

If ^ is a good estimate of 9, its value usually, if not always, increases with 9 
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for all possible values. Thus, we shall restrict our further considerations to 
cases in which we may assume that t increases as d increases and vice versa. 
More precisely we are concerned with one-valued monotone increasing func- 
tions represented by the two curves marked C* and The region bounded 
by these two curves and the lines 6 — a and ^ = /3 has been called by Neyman 
the confidence belt with confidence coefiicient equal to 1 — 2€. 

Now suppose that for one particular random sample, of the size assumed in 
setting up the confidence belt, we obtain the value t = ^o. Regardless of the 
actual (but unknown) value of the probability that the point (i^o, B) will fall 
within the confidence belt is 1 — 2^, as is clear from the way in which the belt 
has been constructed. Consider the line = ifo in relation to the belt. This 
line (parallel to the 0-axis) will, as a rule, intersect the belt in two points, 
(^ 0 , ^i) and (^ 0 , ^ 2 ), 01 < 02. The probability that it will fail to do so is less 
than 2€. The point (io, 0) is outside the belt for any 0 < 0i and for any 

0 > 02. In fact, 01 is the lower bound of values of 0 for which the probability 
that t > to IB at least equal to e, and similarly 02 is the upper bound of values 
of 0 for which the probability that ^ is at least e. These boundary values 
of 0 are called the confidence limits of 0 corresponding to ^ = U, and the interval 

01 to 02 is called the confidence interval for t = ^o, with the confidence coefficient 

1 - 2 €. 

There is no guarantee that for one particular random sample we shall be 
right in claiming that the true value of 0 lies within the confidence interval, 
but the 'probability of our being right (in the sense of the relative frequency of 
true statements in repeated sampling) is equal to 1 — 2e. If € is small, we 
shall be right nearly all the time, but, of course, the width of the belt increases 
as € diminishes. The surer we are of the statement, the vaguer is the state- 
ment itself. 

6.5 Standard Error of the Mean. By the standard error of a statistic we 
mean an estimate of the standard deviation of the sampling distribution of 
that statistic. Since the sampling distribution usually depends on one or 
more unknown parameters which have to be estimated from the sample itself, 
these estimates must be substituted for the parameters in the formula for the 
standard error. It is desirable that such estimates should be unbiased and 
also “ most-efficient in the sense of having as small a sampling variance as 
possible, at least for large samples (see § 12.3). If the sampling distribution 
is such that, when the size N of the sample tends to infinity, the distribution 
function tends to that of the normal law, it is said to be asymptotically normal. 
The use of standard errors is usually restricted to distributions which are 
asymptotically normal, since then we may, as a rule, approximate to the per- 
centage points of the distribution of the statistic with those of a normal dis- 
tribution whose standard deviation is equal to the standard error. This gives 
an intmtive approximate interpretation of the standard error. It is not, 
however, necessarily true that, because a distribution tends to a normal dis- 
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tribution with variance the variance of the distribution tends itself to the 
value as the following example shows.* 

Let the frequency function be 

/ = + jv^ 

where as usual 4>(t) = (27r)-“^^2 exp (—t^l2). Then it is easily verified that 
E(t) = 0 and E{P) = 2N‘~\ so that the variance of this distribution is 2N-^ 
for any finite N. However, for a fixed value of t, the function / tends, as 
iV — > 00 , to which is a normal distribution with variance N-'K 

The variance of the limiting distribution is not therefore the same as the limit 
of the variance. The reason is that, as N increases with fixed the second 
term in / tends to zero. The function tends itself, however, so 

rapidly to zero for large N and for non-zero values of t that the contributions 
of both terms in / to the variance are of the same order of magnitude. 

If a random sample of N is taken from a population with known distribution 
function, the N values Xi, * • • Xn may be considered as independent 
variates having the same distribution. If the mean of the sample is 

(6.2) X = 
then by Theorems 4.4 and 4.9 

(6.3) E(X) = M 

(6.4) Var (X) = 

where /i and are the expectation and variance of the known distribution of 
the X^. 

By the Central Limit Theorem the distribution of X tends to normality 
as iV — » 00 (and it is normal for any N by Theor. 4.11 if the parent population 
is normal). Hence the standard error of X is 
If a is not known it must be estimated from the sample. In large samples 
the sample standard deviation s may be substituted for a. 

It is not obvious how large N should be before we can regard the distribution 
of a statistic as practically normal, even when we know that it is asymptoti- 
cally normal. For some statistics, including the arithmetic mean, a sample 
of 30 can usually be considered large; for others, such as the coefficient of 
correlation, a sample of 500 may not be sufficient to ensure a good approxima- 
tion to normality. 

It is customary to distinguish between parameters and sample statistics 
by using Greek letters for the former, and Latin letters for the latter. We 
shall adopt this useful convention. Thus m (= x), m 2 , • • • will be used for 
sample moments corresponding to the population parameters /i, ju 2 , • • • 

* The authors are indebted for this illustration to a critic who read the book in manuscript. 
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6.6 Confidence Limits for the Mean. We know that m is (or tends to be) 
normally distributed about m with standard deviation a-x = a/VN. If the 
distribution of m be reduced to standard units by the transformation 

(6.5) t = 

cr 

Vn 

then we know that t is approximately normally distributed about zero with 
standard deviation unity. Hence we can refer to a normal probability 
scale for the probability that one would obtain a random sample for which m 
differs from /i by as much as 1 5 |, where d is expressed in the or^ unit. So we 
have the following theorem. 

Theorem 6.1. The probability Qs that a random sample from an infinite 
universe will have a mean, m, which will be within a distance \ ^\ of the mean, fi, 
of the universe is approximately 

r\^\ 

Qg = 2 I (h{t) dt 

where d is the observed value of t given by (6.5) and (l>(t) is the normal curve. 
Then Ps I — Qs is the approximate probability that m will not be within | 5 | 
of /X. If the universe is normal, Ps gives the exact probability. 


<^(t) 



Eig. 17. Ps - fii/R . Qs IS THE Probability for a Deviation Equal to or Less 

THAN 1 S 1 , AND Ps IS THE PROBABILITY FOR A DEVIATION GREATER THAN [ S j 

On the assumption of a normal distribution we may, in large samples, 
define the probable error of a statistic as 0.6745 times the standard error. 
This depends on the fact that if 5 = 0.6745, Qs == 0.5, very nearly. The 
probability of a deviation from the true population value numerically at least 
as great as the probable error is therefore The term probable error is still in 
fairly common use in some branches of astronomy and physics. 

We can also define confidence limits, based on the normal distribution, 
corresponding to any desired degree of confidence. Thus the probability 
that a sample mean will lie Avithin an interval of 1.96 times the standard error 
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on either side of the true mean is about 0.95, the probability that it will lie 
within 2.58 times the standard error on either side is about 0.99, and so on. 

When the true mean is unknown and the sample mean is used as an estimate 
of it, we may form a confidence belt, as illustrated in Figure 18. If we assume 



that the variance cr^ is fixed, but that different values of m are possible in the 
parent population, we see from the figure that the 95% confidence limits for 
any given value of X are X ± 1.96cr/VJV, or approximately X ± 
for large N, In this example the belt is straight and of constant width, so 
that even if we assume that all our samples are from one and the same parent 
population, it is still true that the confidence interval calculated as above will 
cover the true population mean in 95% of random samples from this popula- 
tion. 

6.7 Null Hypothesis and Significance Tests. The rationale underlying 
sampling theory has been summarized by E. S. Pearson^ as follows: 

In applying the methods of statistical analysis it is generally our aim to discriminate 
between two or more alternative hypotheses regarding the factors which have controlled 
certain observed events, which form what we term a sample or samples. If the process 
is examined in a httle detail it will be found that the procedure may be described as 
follows: 

(а) We define a hypothesis to be tested. 

(б) We choose the criterion (or criteria) whose numerical value, derivable from the 
observations, is most suitable for testing the hypothesis. In doing this we 
recognize that the criterion is not a single-valued expression even if the hypothe- 
sis be true, but will vary from one sample of observations to another. 

(e) We therefore refer the observed value of the criterion to this sampling distribu- 
tion — c.gr., to a. normal probability scale, etc. — and so obtain a measure of 
the likelihood of the hypothesis. 

(d) Finally, if judged on this probability scale the observed criterion is not ex- 
ceptional, we conclude that upon the information available there are no grounds 
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for discarding the hypothesis; or if the value prove exceptional we consider the 
possibility of alternative hypotheses. 

A hypothesis which is tested for possible rejection under the assumption 
that it is true has been called by Fisher ® a null hypothesis. Commonly such 
a hypothesis assumes that the difference between the true value of a parameter 
and some assigned number is zero. In other words, null hypothesis refers to 
a particular form of population distribution which is assumed in considering 
whether or not a sample could reasonably have arisen from this population. 
If the sample could not reasonably have arisen from the population proposed, 
as measured by a significance test, we say that the nuU hypothesis is refuted 
for the level of significance adopted. If the significance test yields a verdict 
of “not significant” for the probability level adopted, we say that the null 
hypothesis is not refuted or contradicted at that level. 

It is open to the investigator to be more or less exacting concerning the 
smallness of the probability he would require before he would be willing to 
admit that his test has demonstrated a significant result. However, it is 
conventional among certain workers to adopt the following rule: 

If Pb > 0*05, h is not significant; if 0.05 > Ps > 0.01, d is significant; 

if Pi < 0.01, 5 is highly significant. 

This is the rule^we shall usually adopt (see § 2.11). 

Other statisticians prefer to describe 8 as definitely significant only when 
Pi < 0.01, and to regard a value between 0.01 and 0.05 as merely suggesting 
some doubt as to whether 8 is significant or not and calling for additional infor- 
mation. Occasionally a still more conservative attitude may be justified. 

Example 1. Suppose the mean span of 100 persons is found to be m = 70.56 inched. 
Does this differ significantly from the mean At = 69.943 of the universe^’ with standard 

deviation o- = 3.115? Calculating the above test we find 8 = — — = 1.99. 

3.115/VlOO 

Referring to the normal probability scale we find the chance of a difference between the 
observed and hypothetical means as large as that noted to be P& — 0.0466. Our conclusion 
is that the given statistic m = 70.56 is rather exceptional, and the sample qmte possibly 
came from a different universe, that is, in this case a different race of men. 

Example 2, Twelve dice were thrown 26,306 times (Weldon’s data), and a throw of 5 or 
6 points was reckoned a success. The mean of the observed distribution was found to be 
4.0524. In tossing a true die the chance of scoring 5 or 6 is so the number of dice scoring 
5 or 6 should be distributed with frequencies proportional to the terms in the expansion 
(f + Therefore, the expected mean, on the hypothesis that the dice were true, is 
sp = 12(4) ~ 4. Test this hypothesis using the difference between the observed and 
theoretical means as a criterion of judgment. 

SoMim. £ ^ (spqy^^ * {(12)(4)(f)li/» = 1.633 

N = 26,306 
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The probability that a deviation outside 5 = + 5 would happen by chance is extremely small 
SO we conclude that the dice were biased. 

Example 3 For 868,445 U, S. Army recruits in World War I, the mean weight was 
141.54 lb, with a standard deviation of 17.36 lb. The standard error of the mean is 
17.36/(868,445)^/2 = .0186 lb. Regarding these men as a random sample of men of military 
age m the United States in 1917-18, we can say that there is a 95% probability that an 
interval from 141.50 to 141.58 lb would cover the true mean weight of the parent population. 

Samples are seldom as large as this and confidence limits as narrow. The above calcu- 
lation does not take into account errors of weighing arising from various causes. The 
statistician, as such, accepts the data given. The effect of such errors would be to increase 
somewhat the estimate of the standard error. 

6.8 The Distribution of the Mean. We have already calculated in § 4.16 
the first four cumulants of the distribution of the mean of a random sample of 
N observations from an infinite parent population, the cumulants of which are 
given. If we denote the moments of the population of means of all possible 
samples of N from the parent population by Wi, • • • , ilfi, M 2 , • • • , we 
have in the new notation* 

( 6 . 6 ) ^1 = 1^1 

(6.7) M 2 = ii^/N 

(6.8) M 3 = ixz/N^ 

(6.9) M 4 = ix,/m + 3(i\r --- 

Hence the skewness of the population of means is given by 

( 6 . 10 ) Gi = 

and the excess of kurtosis by 

(6.11) (?2 - M 4 /M 22 - 3 = 72 /A^ 

The skewness and excess both tend to zero as iV increases, and in fact Tor 
moderately large iV the distribution of the mean is approximately normal 
for parent populations which depart very widely indeed from the normal form. 

6.9 An Experiment. We will now describe an exercise in experimental 
sampling which will help make the theory more meaningful. It was per- 
formed by a class of thirty students who took the distribution of Table 6 as a 
^^universe.^^ 

In a box were placed 2000 discs f each bearing a number from the set 
1, 2, 3, • • • , 25. The numbers on the discs were coded to the span values in 

* Strictly, these should be Greek letters, since they refer to a theoretical distribution, but 
it is convenient to use ordinary capitals, and the notation will not often be used in later 
chapters. The corresponding notation for sample momenjtS'is 71^ nin but we usually write 
m or 5 for ni (and fi for vi). 

t Small metal rimmed price tags were used. Ideally, each individual disc should be 
returned to the box before the next is drawn. However, this was not insisted upon and an 
entire sample may have been drawn before replacement. 
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X 

/ 

585 

1 

59 5 

2 

()0.5 

1 

G1..5 

6 

02.5 

7 

5:i5 

22 

(H 5 

55 

65 5 

111 

0() 5 

146 

67 5 

182 

68 5 

229 

69.5 

265 

70.5 

26:i 

71 5 

217 

72 5 

176 

7:t5 

i:32 

74.5 

82 

75.5 i 

48 

76.5 

20 

77 5 

16 

78 5 

12 

79 5 

3 

80.5 

1 

81.5 

2 

82.5 

1 


accordance with the scheme shown in Table 6a, and the frequency of the vari- 
ously numbered discs equaled the frequency of the corresponding Each 
member of the class drew’' samples from the box according to the following 
directions. 


Directions 

1. Intermix the discs thoroughly and withdraw four random samples of 
ten discs each. 

2. Record the numbers in each sample of ten on the sampling record sheet; 
replace the discs in the box. 

3. For each sample of ten: find (a) mean span, (6) variance, (c) standard 
deviation. 

4. Combine the four samples mto a single sample of forty and find the 
statistics named in 3. 

The results of 3(a) will be reproduced here. There were, of course, 120 
means from samples of JV” — 10. These were then grouped into a frequency 
distribution. The resulting distribution and its moments, together with the 
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Table 6a. Sampling Record Sheet 


' Span 

Number First Second Third Fourth 

on Disc Sample Sample Sample Sample 

58 5 

1 

59.5 

2 

60.5 

3 

61.5 

4 

62.5 

5 

63.5 

6 

64.5 

7 

65.5 

8 

66.5 

9 

67.5 

10 

68.5 

11 

69.5 

12 

70.5 

13 

71.5 

14 

72 5 

15 

73 5 

16 

74.5 

17 

75.5 

18 

76.5 

19 

77.5 

20 

78.5 

21 

79.5 

22 

80.5 

23 

81.5 

24 

82.5 

25 

Mean * | 

Standard Deviation 


* In computing the statistics let x denote span and u the number on a disc Then w-x — 57 5, z“w + 57.5, 
and % = 8u> Note that in this book x and X are both frequently used for variates. The context should prevent 
any ambiguity 


moments of the universe, are given in Table 7. (The computations were 
made according to the definitions given in Part I for the moments of an ob- 
served distribution.) 

Although the chief purpose of the experiment is an appreciation of the 
theory, it is of interest to compare the experimental and theoretical results. 
According to (6.6) the mean should be 69.943; we obtained 69.785. According 
to (6.7) the standard deviation should be 3.115/ (lOP'® = .985; we obtained 
.894. 
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Table 7. Distribution of the Means op 120 Samples of N == 10 Drawn from the 

Universe of Span 


Interval 

Mid X 

Frequency 

Moments 

67.0-67.3 

67.15 

1 

Mean x = 69.785 

67.4-67.7 

67.55 

1 

- 0.8941 

67.8-68.1 

67.95 

4 

' 68 2-68.5 

68.35 

4 

gi ’z = 0.052 

68.6-68.9 

68.75 

5 

02. X ^ 0.030 

69.0-69.3 

69.15 

19 

69.4-69.7 

69.55 

27 


69.8-70,1 

69.95 

20 

M = 69.943 

70.2-70.5 

70.35 

20 

70.6-70.9 

70.75 

7 

O’ = 3.115 

71.0-71.3 

71.15 

6 

71 = 0.161 

71.4-71.7 

71,55 

3 

71.8-72.1 

71,95 

3 

72 = 0.296 


6.10 Standard Errors of Moments. By definition, 

1 ^ 

( 6 . 12 ) 

1 ^ 

(6.13) Wr “ ”” 

-iV 

If the population rth moment exists, 

(6.14) E{nr) = E{x^^) = Vr 
Also the variance of nr is given by 

Var {Ur) = E(nr — Vr)^ 

= -h - 2Nvr^x,^ + AV} 

— — V,- + ^EiXi'T/) 

Since x, and x, are assumed to be independent, 

E(X,’-X,') = E(,X^^)E{X/) = Vr^ 

The number of terms irr-the sum is N(,N — 1). Therefore 

(6.15) Var (a,) = ^ >'2r — Vr^ + — p,® 

= (y^r - y/)/N 
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Thus the variance of is given by 
(6.16) Var (ng) = (j^4 ~ V 2 ^)/N 

The calculation of the variance of the sample moments about the mean Is 
more complicated. The standard error of ni is and since for 

moderately large values of N the distribution is nearly normal, almost all the 
actual values of ni will lie within a distance of vi of order Hence if we 

take the origin at the true mean of the parent population, so that vi = 0, 

E{mr) = 4 - niY 

where ui is small of order and powers of Ui higher than the first can be 
neglected. Therefore 

E{mr) == ^ - TnlX^^^) 



Now E{xt = E(x^)E(x/‘^^) = 0, since Eix^) = 0. Hence, neglecting 
terms of order iV-^ 

(6.17) E{mr) - fir 

It may be proved in the same way that for large N 

(6.18) Var (mr) ~ ^ [fl 2 r — 2r/lr-lAtr+l] 

(6.19) Cov (Mr, nis) == ^ [/4r4.« ~ flrfla + rSH 2 flr-lfls-l ~ rfir^ifJL,+i — Sfla^iflr^n] 
Thus, for m 2 , 

(6.20) Var (m 2 ) ~ ^ (m 4 — fi 2 ^) 

and, if the parent population is normal, 

(6.21) Var (m 2 ) = 2a*/N 

Similarly, if the parent population is normal, we find from (6.18) that 

(6.22) Var (mg) - 6<rVV 

(6.23) Var (m 4 ) - 96a^/N 

noting that m 4 = 3<t^, m 6 = 15cr®, and jus = 105o^, 'frhile all the fi^s of odd order 
vanish. 

Approximate standard errors of functions of these moments, such as 
= m%lm^^^ and &2 = may be calculated by taking differentials 

of these functions, and noting that, since the fluctuations of any moment are 
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of order squares and higher powers of the differentials of the moments 
can be neglected. Thus, 

i Sbi/bi ~ bmzjmz — f Snhlrm 

Squaring and taking expectations we find, for a normal population. 


1 

4 bi 2 


Var (6i) 


Hence 


= Var (m3) + 


9 

4 m 2 ^ 


Var (m2) “ 


3 

rrhmz 


Cov (m2, m3) 


i_r^ 


+ 


4 m 2 ^ 



Var (61) 


4 r6<r^ ■ Qq-^msn ^ 
iV _^m2^ 2m2^ . ^ 


so that to terms of order 

( 6 . 24 ) Var (V^i) = e/iV 


since to this first approximation we can put m3 = 0, m2 = cr^. 
Similarly, 

dh2/b2 = dm4fm4, — 2 5m2/m2 

whence 


^ Var (62) = ^ Var (m4) + ^ Var (m^) 


m2m4 


Cov (m4, m2) 


By ( 6 . 18 ) and ( 6 . 19 ) we find for a normal parent population 
( 6 . 25 ) Var (62) - 24 /iV 


6.11 Sampling from a Finite Parent Population. Suppose that the parent 
population is of size M and that a random sample of N is drawn. The prob- 
ability of drawing an individual from a given class is affected each time that 
one is drawn from that class. The number of possible different samples is 



In our notation the population mean is given by 


M 

Mu = 


the sample mean by 


N 

Nm = 


and the mean value of m for all possible samples by 
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1 (¥) AT 

we note that in this double sum all possible values of Xj occur with the same 
frequency, since the sum includes all possible different samples. Since there 

N /M\ 

are only M possible values of Xj. each value must be repeated times. 


We have 


so that 


\n) iV TIT M 

y y^ ^ 

^ ^3 M\N ) ^ 


The mean of all possible different samples is thus equal to the mean of the 
parent population, or in other words the mean of one sample is an unbiased 
estimate of the mean of the parent population. 

By similar algebraic methods it is possible to show^ that the variance of 
the population of means is 

M - N 


the skewness is 


and the kurtosis is 


~ N{M - 1 ) 


M -2N 


-2N r 

-2 L 


M-1 - 
NiM - N), 


(M - 1 )(M^ - 6MN + M + emy 2 - eM(MN + M - - 1 ) 

NiM - 2)(M - 3)(M - N) 


When M —> <x) these last three equations become identical with (6.7), 
(6.10), and (6.11) respectively. The moments of the parent population must 
then be defined in terms of probabilities, and it is assumed that these moments 
exist. 

The conclusion of investigators is that the distribution of means from nearly 
any finite universe is practically normal. In this connection the following 
striking example is given by Carver.^ ^ 

A group of students chose arbitrarily the following most unusual distribu- 
tion for a parent universe: 
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Table 8 


X 

/ 

3 

2 

15 

9 

29 

43 

405 

189 

1710 

37 

Total 

280 


JV 

and found the distribution of = Nx of 1000 samples of twenty-five vari- 

1 

ates each shown in Table 9. It was obtained as follows. 


Table 9 


Class 

/ 


2 


54 



11,000- 

310 

13,000- 

254 

15,000- 

130 

17,000- 

36 

19,000- 

9 

21,000- 

2 

Total 

1000 


Two hundred and eighty Hollerith cards were punched with numbers corre- 
sponding to the two hundred and eighty variates of the parent population. 
The cards were thoroughly shuffled and then placed in a tabulating machine. 
After twenty-five cards had run through the electric tabulator their total was 
recorded. By repeating this procedure one thousand samples were readily 
obtained. It is thus possible to obtain experimentally some appreciation of the 
sensitivity of the sampling distribution of means to changes in population 
form. Carver concludes that if the sample N is fifty or larger and the popula- 
tion is at least ten times N, the parent population has relatively little control 
over the shape of the distribution of x. 

Another set of experiments was conducted by Shewhart® who comes to the 
following conclusion: ^ 

Such evidence, supported by more rigorous analytical methods beyond the scope 
of the present discussion, leads us to believe that in almost all cases in practice we 
may establish sampling limits for averages of samples«of four or more upon the basis 
of normal law theory. 
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Example 4. Of the 868,445 U. S. Army recruits referred to in Example 3, 27,341 were 
tfom Mmnesota. For these men the mean weight was 146.41 lb. Show that, considered 
as a random sample from the population of recruits, this one is extremely improbable. 

Taking M = 868,446, N == 27,341, ^*2 = (17.36)2, we find from (6.27) that M 2 = 0.01068. 

The deviation of the sample mean flrom the population mean, in standard units, is there- 
fore (146.41 — 141.54)/(0.01068)^/2 - 47.1. On the hypothesis of a normal distribution the 
probability of a deviation as high as this is practically zero, so that the Mmnesota recruits 
certainly hd not form a random sample, as regards weight, of the IT. S. Army. The large 
proportion of men of Scandinavian descent in Minnesota is probably the reason for this. 

6.12 Size of Sample to Have a Given Reliability. From § 6.6 we may de- 
termine the size iV of a sample such that its mean, x, will not differ from /x by 
more than a specified error | 5 |, with a degree of certainty equal to a specified 
probability. 

Example 5. The American Rolling Mill Company investigated ® the life of iron alloys 
under different corrosive conditions. Data obtained from a certain kind of sheet material 
immersed in Washmgton tap water showed that the average time of failure of such samples 
was 874.89 days and the standard deviation of the time of failure was 85.3 days. There 
arose the following question of practical interest to the research engineer of this company: 
What sample size N must be used in order that for similar test conditions, the probability 
shall be 0 90 that the average time for failure determined from the N tests will be in error 
by not more than 5 per cent of the average of the universe? 

Assuming that 874,89 = m and that means of samples of N are distributed normally, we 
may answer this question as follows: The allowable error is 5 per cent of 874.89 days or 43.74 
days, and this must correspond to a probability of 0.90. From Theorem 6.1 we have 

QS = (ft = .90 

that is ^ 

f 4>(t) dt ~ .45 
Jo 

whence from the tables we find 5 = 1.645. Hence N is found by solving the equation 

1.645 = 43.74 

Vn 

where <r = 86.3. We fiad N = 10. 


<5.13 Standard Error of an Observed Proportion. If we take samples of 
N from h binomial population with parameter 6 the probability of Np successes 

is 0^^(1 — 6)^^ where q = 1 — p. AsN— ^oo the binomial distribu- 


tion tends to a normal distribution with mean NO and variance N6(l — 6). 
Hence the standard error of Np is [JV0(1 — 6)fi^ if d is known. 

If 0 is not known, we must use an estimate of it. The natural estimate to 
take is p, which is the relative frequency of success^ the sample, so that the 
standard error of Np is approximately (Npqy^ for large N. The standard 
error of the proportion p is, therefore, (pq/N)^'^. 

This estimate is unbiased since the expectation of p is the parameter value 
e. Thus 
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= d[e + (1 - ew-^ = e 


ey^ 


This estimate is also the one obtained if we inquire what value of d is most 
likely to give the observed sample, that is, what value of for given p, will 

make / = — sy^ a maximum. Upon setting df/dO = 0 we get 

JVp(l — 0) — 6 Nq = 0 whence 6 = p. 

The method of estimation just used is knowm as the method of maximum 
likelihood^ (See further in § 12.4.) 

6.14 Standard Error of a Dijfference of Proportions. In the analysis of data 
obtained by sampling, certain problems occur which relate to the significance 
of apparent differences in proportions. Suppose we have two random 
samples of size ni and ^ 2 , respectively, with Xi individuals of the ni items and 
X 2 of the 712 items which have a certain character or attribute. The question 
arises as to whether the observed difference is merely an accident of sampling 
or whether a similar difference exists in the universe. The following theorem 
may be used to test the null hypothesis that Xi/ni and X 2 /n 2 are random and 
independent samples from the same universe. 


Theorem 6.2. If Xi/ui and X 2 /n 2 are the observed proportions in random and 
independent samples from infinite populations in which the same proportion 6 of 
individuals have the character in question, the prohability that the difference in 
these observed proportions will be numerically as great as the observed difference 
(w = Xi/ni — X 2 /n 2 ) is approximately Pj, where Ps is defined in Theorem 6.1 
and , /-j IN'! 1/2 

Proof: The expectation of Xi/ni is 6 and its variance is ^(1 — 6 )/nu Simi- 
larly, the expectation and variance of X 2 /n 2 are 6 and ^(1 — 6 )/n 2 . Hence 
the expectation of w is zero, and by Theorem 4.11 its variance is 


(6.30) 


a '«,2 = 0(1 - 



The ratio w/ay,, therefore, varies about zero with unit standard deviation. 
Information about the form of this distribution may be obtained from its 
higher moments. It is not difficult to show (see Problem 11) that 


(6.31) 


2 — (ni — n^y ^ 1 — 4^(1 — 6) 

^1^2 (^1 "h Tii) B(l. — &) 

— ^1^ ““ ^1^ + ^2^ ^ 1 6^(1 — 6) 

niUtini + ^ 2 ) d{l 6) 


(6.32) 
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For fixed 6 it is clear that 71 and 72 0 as the samples are taken indefinitely 

large. Even for moderately small samples the distribution of w/a-,, does not 
differ greatly from the normal form. The following empirical rule, suggested 
by E. S. Pearson, is useful. 

Rule. Suppose rii < (we are at liberty to call either ni). If uid > 5 the 
use of the normal probability scale is justified. If ni6 < 5, examine 71 
If 71 ^ < 0.04 the normal law is still sufficiently accurate, but if 71 ^ > 0.04 no 
great confidence can be placed in the test. 


In order to apply this rule, or Theorem 6.2, an estimate of 0 is required. 
It is usual to take 


(6.33) 


§ = + ^2 

Til “h ^2 


and it is easy to show that this estimate is unbiased, its expectation being 

Example 6. Suppose that in a survey conducted for a company selling a certain brand 
of tires, “XX,’’ 750 persons polled m district A said they planned to purchase tires shortly 
and 300 of them said they intended to get “XX” brand. In district B, out of 600 persons 
planning on purchasing tires, 210 mtended to get “XX” brand. Could this difference in 
the proportion of prospective “XX” purchasers be attributed to sampling fluctuations or 
should the company look for some other explanation? 

Our null hypothesis is that the proportion 6 of prospective “XX” purchasers is the same 
in A and B. As an estimate of 6 we take S — 510/1350 = 0.3778. 

The difference of proportions is 0.40 ~ 0.35 = 0.05, and the variance of the difference is 
0.3778 X 0.6222 X (-wh + viir) = 0.000705. Hence w/a^ = 0.05/0.0266 = 1.88, and the 
probability of a deviation numerically as great as 0.05 is about 0.060. The difference can, 
therefore, be attributed to sampling fluctuations, although it is rather close to the borderline 
of significance. 


6.15 Confidence Limits for the Parameter of a Binomial Distribution. If 
in N trials of an event with probability 6 the number of successes is iVp, 

E(p) = d 

Var (p) - ^(1 e)/N 

Since for large values of N and for 0 not very near to 0 or 1 the binomial dis- 
tribution is approximately normal, we have 

Pr{p - $< t[e(l ~ ^)/iV]i/2} == ^(t) 
or 

(6.34) Pr{0 - - 9)/Nf!^ <P<e + tMl - = 1 - « 

Avhere is the value of t for which $(<„) = 1 — a/2, so that 100(1 — a)% is 
the confidence coefficient corresponding to f„. ^ 

In practice, however, 6 is unsown. From (6.34), we have 

- '[0(1 - eyN]^i^ < ««} = 1 - a 


( 6 . 35 ) 
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At the end-point of this interval 

(d - py = ^(1 - e)tJ/N 

so that 

N(e - pY = ea - 9)tj 

This is a quadratic in 6, which may be written 

BHN 4- ty) - e(2pN -h tj) +Np^ = 0 
Solving for 6, we get 

ff. .ps „ Np 4- iv ± UNpq + 

(6.36) e 

If these two extreme values are written, as pua, Pia, corresponding respectively 
to the + and — signs in (6.36), (6.35) may be written in the form 

(6.37) Pr{p,« <e<pua} a 


In this form the equation gives confidence limits pia and pu^ for 6. If we 
assert that 6 lies between these limits on a large number of occasions (the 
limits, of course, varying from sample to sample), we shall be right in a frac- 
tion 1 — q: of these assertions. 

Thus if in 400 trials of an event with constant probability 6 of success, we 
find 280 successes, our estimate of 6 is 280/400 = 0.7. If we take = 1.96, 
so that a = 0.05, the values of put and pu^ are 0.698 — 0.045 and 0.698 + 0.045 
respectively, so that the 95% confidence limits for 6 are 0.653 and 0.743. 

For small values of N, the normal approximation is not justified. However, 
it is possible to determine the parameter 6i of a binomial such that the ob- 
served value of r = Np cuts off the upper 2|% tail of the distribution, that is. 




This can be done by a rather tedious method of successive approximations.* 
The value of 6i then provides a 95% lower confidence limit for B. In the same 
way, 6% can be found so that in the corresponding binomial distribution the 
observed Np cuts off the lower 2|% tail. Then 62 provides the 95% upper 
confidence limit for 

Useful charts have been prepared by Clopper and Pearson, by means of 
which the 95% and 99% limits for any observed value of p may be read off 
approximately. These charts are drawn for several different values of N 
between 10 and 1000. The 95% chart is reproduced in S. S. Wilks^ Elementary 
Stcdidical Analysis (194S7rP* 201. 


* More conveniently, we can use the new Tables of the Binomial Probability Distribution 
(National Bureau of Standards, Washington, 1950) which give individual terms and cumu- 
lative sums of terms for values of N from 2 to 49. 
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6.16 Sampling from a Finite Binomial Population. For a finite parent 
population of size M and a sample of size N, the expectation and variance of 
the sample proportion p are given by 


(6.38) 


Eiv) = 

M - Ne{\ - e) 


Var (p) = 


M - 1 


N 


For large N and still larger M, p may be regarded as normally distributed 
about 6 j so that (p — d)/N X (M -- N)/{M — is approximately 

a standard normal variate. Confidence limits may then be determined as in 
§ 6.15. 

6.17 Confidence Limits for the Difference of Parameters in Two Binomial 
Distributions. Suppose that independent samples of sizes Ni and N 2 are taken 
from two binomial populations in which the proportions of individuals having 
a certain characteristic are 61 and 62 ^ If the observed proportions are pi and p 2 , 
(pi > P 2 ), we have 


(6,39) 


r E(pi - P2) = - ^2 

j Var (p, - p,) = 


.MLnJA 


Assuming as before a normal distribution, we have approximately 

Pr{pi ~ P 2 - ma - e,)/Ni + ^ 2(1 ~ e2)/N2Yf^ < ( 9 i - 82) 

<pi-P 2 + tMa “ ei)/Ni + 62(1 - e2)/N2]^f^} = 1 ^ a 

where, as previously noted, the probability refers to the variables pi and pz in 
repeated pairs of samples from the same two parent populations. 

In this inequality the limits depend on the unknown parameters 61 and 62 . 
If we replace these by their estimates pi and p 2 , respectively, we have approxi- 
mate confidence limits, 

(0 40) Via == Pi ~ P2 - tJpiqi/Ni + P 2 q 2 /N 2 y^^ 

Pua = Pi - P2 + L(piqi/Ni + P2q2/N2y^^ 

where = 1 - pi, g 2 = 1 - P 2 . 


Example 7. Out of 400 voters polled in constituency A, 160 stated that they intended to 
vote Liberal in a forthcoming election. Out of 500 polled in constituency B, 250 stated that 
they would vote Liberal. 

Here pi = 0.5, pt = 0.4, Ni = 500, N2 = 400, and 

(^ + 1^T= 0-0333 „ 

Taking ta = 1.96 we have, as 95% confidence limits, pia « .035, =: .165. The difference 

of the proportion of Liberal voters in the two constituencies may be expected to lie between 
these Imuts, assuming that the voters polled form random samples of their respective con- 
stituencies, and that they really vote as they say they will. 
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6. 18 Confidence Limits for the Poisson Distribution. Variables distributed 
according to the Poisson law arise in several physical and biological problems. 
Thus, if organisms of a certain species are distributed at random over the bot- 
tom of a lake, the numbers found in a series of trial dredgings from separate 
small areas of the same size will follow this law. The frequency function of 
the Poisson distribution contains a single parameter m which is the expecta- 
tion of the variable X, For any given m, values Xi and X% can be calculated 
such that the probability that X lies below X 2 or above Xi is, say, 0.025, and 
upper and lower confidence curves can be constructed.^^ Since the Poisson 

distribution is discrete, these curves are stepped curves. Thus, for any given 

00 

m, there will, in general, be one value of Xi such that < 0.025 

and IX\ > 0.025, but as m increases through a value such that 

00 

= 0.025, the corresponding Xi increases by one unit. The 
curves in Figure 19 are drawn just inside the calculated step curves. We see 



from these curves that if X = 21, m lies between 13 and 32, so that if, for ex- 
ample, we count 21 organisms in a square yard of a lake bottom, we can 



Sec. J9 


Picking a Random Sample 


151 


slate with 95% confidence that the average number of organisms per square 
yard lies between 13 and 32. 

6.19 Picking a Random Sample. A random sample was defined in § 6.1 
as a sub-set of a set of variates in which each individual variate from the 
parent population has an equal chance to be included. This is, however, more 
of an abstract than a practical definition and gives no indication of how to 
obtain a random sample or to test a given sample for randomness. These 
matters are of very great importance for experimental statisticians, and, 
although tiiis book is primarily concerned with mathematical statistics, some 
remarks on the subject may be worth while. 

In the first place, we note that the use of the theory of probability to make 
inferences about a parent population from observations on a sample (for 
example, to test some hypothesis about the population or to calculate con- 
fidence limits for a parameter) is based on the essential assumption that the 
sample is random. A purposive sample, which is one selected according to 
one or more definite criteria in the mind of the experimenter and which depends 
on his skill and judgment in making the selection, may be thought to be 
more ^Hypicab^ of the parent population than a random sample, but the 
reliability of inferences about the parent population cannot be assessed from 
such a sample. 

If the parent population is finite, the individuals may be numbered and 
arranged in order, and a sample taken by choosing, say, every kth. individual 
starting with the fth, i being some positive number from 1 to k chosen at ran- 
dom. This is called systematic sampling. If the attribute under discussion 
is clearly independent of the ordering (as, for instance, if a group of students 
are arranged in alphabetical order of their surnames and a sample is wanted 
of their performance on an intelligence test), the sample is practically equiva- 
lent to a random samplet Care must be taken that there is no natural peri- 
odicity of period k in the parent population. Thus if every tenth house along 
a street is selected, and if there are ten houses to a block, it may happen that 
every selected house is a comer house. Such a sample of households would 
probably not be representative of the population in respect of any attribute 
depending on economic status. If there is some serial correlation between 
the individuals of the population (as arranged in order), that is, a correlation 
between, say, the ith and the (i + j)th individuals, the variance of a systematic 
sample may be greater than, equal to, or less than the variance of a random 
sample, depending on the extent of this correlation (see § 6.22). 

Various mechanical devices may be used to obtain random samples from a 
finite population, ranging from “drawing numbers out of a hat to the elabo- 
rate revolving drums filled with metal cylinders each containing a number, as 
used in large lotteries. A simple method is to write the numbers on cards 
which are then thoroughly shuffled. A selection of some of these cards will 
give an approximately random sample of the numbers. All such devices are 
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subject to possible mechanical imperfections which may introduce a bias. 
Even the best-made roulette wheels will show a slight tendency to favor some 
compartments above others. Cards will differ slightly in their tendency to 
stick together, and shuffling by the customary procedures is notoriously im- 
perfect. 

Large tables of numbers, such as telephone directories for big cities or sets 
of statistical tables, may be used to get random numbers. The table is opened 
haphazardly and the right-hand digits, or pairs of right-hand digits, are 
written down in order, any numbers of less than four digits being ignored. 
It has been found in practice, however, that this method is not satisfactory 
— the frequencies of the different digits often differ from their expected values 
by larger amounts than can reasonably be attributed to sampling fluctuations. 
Mathematical tables, such as seven-figure logarithms, should not be used in 
this way because of the obvious relationship between successive entries — the 
first differences tend to be constant over many entries. The 15th to the 19th 
digits in A. J. Thompson’s 20-figure tables of logarithms (Loganthmetica 
Briiannica) have been used, however, by Fisher and Yates in obtaining random 
numbers. Since the first differences are themselves 15-digit numbers, very 
little systematic effect is to be expected. Moreover, additional randomness 
was introduced by selecting pages of the table, choosing a particular column 
(between the 15th and 19th), and choosing the position of the first digit in a 
block, all these choices being made by the use of two packs of playing cards. 
Even so, Fisher and Yates found an undue preponderance of 6’s, which they 
corrected by picking out some of the 6’s at random and replacing them by 
other digits, also chosen at random.^* 

There are now available several sets of numbers which have been carefully 
compiled and tested in various ways for randomness, and one of these sets is 
commonly used when it is necessary to take a random sample from a finite 
population. There are: (a) Tippett’s Random Sampling Numbers, consist- 
ing of 41,600 digits taken from census reports and combined into fours to 
make 10,400 four-figure numbers; (6) Kendall and Babington Smith’s more 
extensive set,^^ consisting of 100,000 digits grouped in 2’s and 4’s, and in 
separate thousands: these were obtained by means of a specially constructed 
machine; (c) Fisher and Yates’s numbers, 15,000 digits arranged in 2’s given in 
their Staiistical Tobies (Table XXXIII) and obtained in the way described 
above. 

Example 8. To draw a sample of 30 from the 1000 children of Table 2 (§ 5.9). 

The individuals are supposed numbered in order of increasing weight, number 1 in the 
group 28-31 lb, numbers 2 to 15 in the group 32-35 lb, and so on. Since 1000 goes into 
10,000 ten times, it is convenioat to multiply the frequencies by 10 and allot numbers 0000 
to 0009 to the first group, 0010 to 0149 to the second group, and so on, numbers 9970 to 9999 
belonging to the tenth group. Opening a table of random numbers (say Fisher and Yates) 
and starting anywhere, we read off consecutive digits in groups of four, thus: 0564, 1270, 
8880, 5835, 0688, 7348, • • • until 30 numbers have been accumulated. The number 0564 
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))elongs to group 3, 1270 to group 4, 8880 to group 7, and so on. For a sample obtained in 
this way, the mean was 47.63 lb, as compared with 47,712 lb for the population. 

If the total frequency m a distribution is, say, 759 instead of 1000, we can multiply class 
frequencies by 12 and neglect any random numbers equal to or greater than 9108. 

Example 9. To arrange 16 numbered objects (such as plots of land in a block) in random 
order. 

A method of doing this is to take the random two-figure numbers as they occur, starting 
anywhere, divide each by 16 and use the remainders (counting 0 as 16). It is rather easier 
to divide by 20 and rej'ect 0, 17, 18, and 19 Whenever a number is repeated in the sequence, 
it must be rejected. This, however, means that toward the end of the process most numbers 
w ill have to be rejected. Alternatively, after choosing the first 7 numbers as above, we can 
divide the next two-figure random numbers that occur by 9, 8, 7, • • and use the remainders 
to indicate which of the numbers from 1 to 16 still remaining unchosen is to be selected. 
Thus, the random numbers may be 

53, 81, 29, 13, 39, 35, 01, 20, 71, 34, 62, • • • 

These give as remainders 

13, 1, 15, I, 0, 11, 14,2 

so that our first 7 numbers are 13, 1, 9, 15, 11, 14, 2. 

The next 8 random numbers are 

33, 74, 82, 14, 53, 73, 19, 09 
which give as remainders on dividing by 9, 8, 7, 2, 

6, 2, 5, 2, 3, 1, 1, 1 

The nine unchosen numbers in order are 

3, 4, 5, 6, 7, 8, 10, 12, 16 

of which the 6th is 8, the 2nd of those remaining is 4, the 5th is 10, and so on. The last 
9 numbers are therefore 

8, 4, 10, 5, 7, 3, 6, 12, 16 

Example 10. To draw a random sample from a normal population with mean 20 and 
standard deviation 4. 

Here the parent population is infinite, but the proportion of individuals occurring in any 
specified interval can be found from a table of the normal law. We may choose intervals 
4-, 6-, 8-, 10-, 12-, • • • which correspond to intervals for the standard normal curve be- 
ginning at -4.0, -3.5, -3, -2.5, -2, • • • . The areas of the normal curve up to these points, 
rounded off to four figures, are 0.0000, 0.0002, 0.0013, 0.0062, 0.0228, 0.0668, 0.1587, 0.3085, 
0.5000, 0.6915, 0.8413, 0.9332, 0.9772, 0.9938, 0.9987, 0.9998. Hence in a table of four-digit 
random numbers, we can allot the numbers 0000-0001 to the first interval (with center 5), 
the numbers 0002-(X)12 to the second interval (with center 7), and so on. 

6.20 Tests for Randomness. Any finite set of numbers, however improb- 
able, could arise by random sampling in a sufficiently prolonged series of trials, 
but a very improbable set would be unsatisfactory, by itself , as a basis for a 
randomizing experiment. In the table of Kendall and Babington Smith 
referred to above, there are 100 groups of 1000 single-digit numbers each, 
and 5 of these groups are starred to indicate that they did not satisfy certain 
tests of randomness. These groups should not be used for sampling experi- 
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ments involving less than 1000 digits, but can safely be used in conjunction 
with at least four neighboring groups. 

Four tests of randomness were used: 

1. Frequency test. The frequency of occurrence of each digit 0 to 9 was 
compared with the expected value (one-tenth of the number of digits in the 
group). 

2. Serial test. The frequencies of all pairs 00-99 were compared with the 
expected values. 

3. Poker test. The numbers were grouped in fours, and the frequencies of 
various combinations of digits (four of the same kind, three of one kind and 
one different, two pairs, one pair, and all different) were compared with ex- 
pectation. 

4. Gap test. The lengths of gaps between successive zeros were counted 
and a frequency distribution formed for comparison with expectation. 

In all cases the comparison was made by the test, with rejection levels at 
1% and 99%. The gap test was not suitable for groups as small as 1000, but 
was used for blocks of 5000, 25,000, and the whole 100,000 digits. 

6.21 Stratified Sampling. Suppose the population, of size AT, is divided 
into sub-populations (strata) of sizes Niy N 2 j • • ' Nk- Let ni, rig, • • • be the 
sizes of random samples from the respective strata and let n — Then 

if, for any variable x. the sample weighted mean and the population mean 
are Xn and Xp respectively, we shall show that, with weights Nt 

(6.41) E(Xn) = Xp 
and 

(6.42) 

where <rt^ is the population variance in the ith stratum. If a heterogeneous 
population is divided into relatively homogeneous strata, the variance of the 
mean may be reduced by sampling in this way. Also there may be compelling 
administrative reasons for stratification,' as, for instance, by counties within 
a state or province. 

If Xm and Xpt represent sample and population means wdthin the ith stratum, 

- - iVAT- - _i^Ar- 

and since Eixm) = Xpi by (6.26), the truth of (6.41) is established. Moreover, 
Var (^„) = E{xn — ^p)* 

= ^ 2iVt®[J5(xn» — Xjm)* + E (cross-product terms)] 

The expectation of the cross-product terms will be zero, since the sample from 
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one stratum is independent of the sample from another stratum. By (6.27) 
the variance in the zth stratum is given by 






so that 


iV, - 1 w. 
Var (x„) = 4 2 




iV, - 1 


which is (6.42). 

It will be proved in the next chapter (§ 7.2) that an unbiased estimate of a^, 
based on the sample of n„ is nxi\n-, — 1) • s,“, where is the observed variance 
for this sample. Hence 

(6.43) Var (x.) - ^ 2 




It has been shown by J. Neyman^® that in stratified sampling the variance 
of the mean is least (and therefore the mean is most accurately estimated) if, 
for a fixed total sample size, the n^ are proportional to If this condition 

is satisfied, and if the cost of taking the sample is proportional to the sample 
size, the sampling is o'ptimum, in the sei^e that the greatest possible accuracy 
is obtained for a fixed cost. 

To prove this, we have to minimize Var {xr!), as given by (6.42), with the 
added condition = constant. Using the Lagrange multiplier^® X, we have 
to minimize, without restriction, the quantity 

Differentiating with respect to n*, we obtain 




so that 


- l)n^ 


H" X = 0 


_£i_ 




approx. 


if we ignore the difference betw'een iV, and — 1. Since = n, we have 

1 


71 = 


SO that 
(6.44) 


rit = n 


JVXi/2 




xCT 1 


2w.' 


The application of this result requires a knowledge' of Xi w-hich is not usually 
available before the sample is taken. However, it is often possible to obtain 
a useful estimate from general knowledge of the population, from a prelimi- 
nary survey, or from previous experience. 
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If the Cl are all equal, n^ = nNt/N, so that the are proportional to the 
corresponding Nt. The sampling is then said to be proportionaL In this 
case,' 

n ^ 

and so is equal to the san%ple total divided by the sample size. The sample 
is self -weighting. ^ 

6.22 Systematic Sampling. If there ar^ N = nk units in the population, 
all numbered consecutively, and if we take a unit at random from the first k 
units and every kth unit subsequently, we shall obtain a systematic sample 
of n. Thus if fc = 20, and the first unit drawn is the 7th, the subsequent 
units in the sample are the 27th, 47th, etc. 

This method of sampling is convenient and rapid. If the individual data 
are on cards, aU of the same size, arranged in a file drawer, a ruler may be laid 
alongside and a card drawn out at, say, each inch wdthout the cards being 
numbered. This is not perhaps strictly every kth.^^ sampling but is very 
speedy. 

In effect, systematic sampling divides the population into strata, each con- 
sisting of k successive units, and chooses one sampling unit per stratum. This 
unit is not, however, chosen at random but occupies the same relative position 
in each stratum. Since the systematic sample is spread evenly over the 
population, it often gives a very accurate estimate of the mean. The theory 
was worked out by W. G. and L. H. Madow\^^ 

Let i be a random number betw^een 1 and k inclusive. The sample mean is 

mi = “b oTi+a, 4* • • • + Xi+(^n^i)k] /n 
Since there are k values of i, all equally likely, 

(6.45) E{mi) = (mi + ^2 + • • • + 'mi^/k = Xp 

w’^here Xj, is the mean value of x for the whole population. ' The sample mean 
is, therefore, an unbiased estimate of the population mean. The variance is 
given by 

Var (mi) = E(mi — Xp)^ = t 2(mi — Xp)^ 

K i 

= 

where 

nmi — nxp == (x^ - Xp) + (xi+k - Xp) + • ■ • + — Xp) 

N 

On squaring and adding,fc.the squared terms sum to ^(xi — Xp)^ = Na^. As 

for the cross-product terms, there are k(n — 1) terms like 2(xi — Xp)(x^^k — Xp) 
which come from observations k units apart; k(n — 2) terms like 2(xi — xf) 
(x^+‘lk — Xp) which come from observations 2k units apart, and so on. Hence 
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(6.46) Var (m.) = 


n'^k 


1c(n-j) w-1 

+ 2 X 




l-hjk 


.,] 


Now let pjL be the non-circular serial correlation coefficient for a lag of jk 
defined by 

k(n-j) 

( 6 . 47 ) kin j)o'*'pjk “ (^i Xp)(xt^jk x^i) 

Then, from (6.46), 

(6.48) Var (m,) = or^ + 2k^(n — j)p^Aj|/ n% 

= cr‘^ [l + I 


This may be compared with the result for a random sample, 


Var (xn) = 




very nearly 


if N is large compared with 1. It follow’’s that systematic sampling will be 
more accurate than random sampling if the serial correlation coefficients are 
negative and sufficiently large. If the serial correlation coefficients are all 
zero, random sampling is the more accurate. 


Problems 


1. Suppose a variable w is normally distributed and a value is selected at random. Show 
that the odds are about 369 to 1 against the value differing from Eiw) by more than 3cr«;^s. 

2. (a) Consider a finite universe of 5 variates: Xi, Xi, xz, x^, xs. The number of distinct 

samples of 3 variates each that may be drawn j ~ 10* Write these down. 

(&) Let Xt represent the ith sample mean and write down the 10 distinct sample means. 
For example, 

Xi + Xz Xz 
x^^ 3 

(c) Show that the mean of the 10 values of is the mean of the 5 values of Xu Thus, 


What formula does this example illustrate? 

3. Show that the expected value of is greater than the square of the expected value of 

w, 

4. From a box containing 2000 discs representing the distribution of span, draw a sample 
of 25 and compute its mean and standard deviation. Test the significance of the difference 
between your mean and the mean of the universe p — 69.9 i3 inches. 

6. Suppose the wei^ts of a sample of 1000 men of the same age are obtained yielding 
X = 140 lb. Assuming that <r = 20.0 lb, what is the standard error of the mean of this 
sample? What is the probability that this mean does not differ from the mean of the uni- 
verse at this age by more than five pounds? 
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6. (Camp) The mean age of death of men who are alive at age 20 is, in the United 
States, 59.13. For the city of Chicago it is 58.98, and in 1910 the male population of age 
20 was 24,000. Can the difference between the United States and Chicago be explained on 
the hypothesis of chance? Assume o- = 10 years, and that the distribution of the universe 
is approximately normal. 

7. (Camp) A fraternal organization wishes to be very sure that the average age of death 
in its group of men now aged 20 will not differ from the expected 59.13 years by more than 
one year. By very sure ’’ it means that Q$ must equal .999 or more. How large should the 
group be? (Assume as before that a — 10.) 

8. Given that 

k 

w ^ 'y(f^ -f xo 

k k 

If the x^8 are independent and ^ /t is a constant, show that 

1 1 

where at® represents the variance of Xi, 

9. Find the mean value of all positive ordinates of the first quadrant of + 2/* = r®, 

(a) when equally spaced along the a:-axis, 

(b) when equally spaced along the circle. 

Answers 

(a) - f 1 /dx = - f V r® — X® dx 

r Jo r Jo 4 



10 . Find the mean value of all the ordinates of the curve 2 / = a + from 0 to a;, when 
equally spaced along the x-axis. 

11 . Derive (6.31) and (6.32). Hint, pr == E(wl<rwy == B(w'')/aw''. 

12 . Show that the moment relations (6.27), (6.28), (6.29) reduce to the corresponding 
relations (6.7), (6.10), (6.11) if Af 00 . 

13 . Suppose 300 mice having cancer of about the same degree of malignancy were divided 
at random into two groups of ni = 100 and 712 = 200, respectively. The first group was 
given a certain serum treatment which was withheld from the second group but otherwise 
the two groups were treated alike. Among the serum-treated there were a^i = 8 deaths, and 
among the other group there were Xz = 25 deaths. Test the significance of the difference 
between the mortality of 8% and 12J% in the two groups. 

14. An instructor had two classes of 20 and 30 students in the same subject. Four in 
the smaller class and 8 in the larger made grades of B or better. Should one seek a further 
explanation of this difference beyond variation due to sampling? 

15. For a sample of 345 eleven-year-old boys, the mean weight was found to be 74.71 lb 
and the standard deviation 10.65 lb. Calculate 98% confidence limits for the mean weight. ' 
Ans. 73.37 lb and 76.05 lb. 

16. In certain states of the United States in a certain year the number of white males 
dying between the ages of 30 and 31 was 1609 out of a total of 253,445 white males in this 
age-group. For Negro males of the same age the number dying was 115 out of 6975. Does 
there seem to be a significant racial difference in the death rates? 

17. A railway company has experimented with two processes A and B for creosoting its 
ties. Of 50 ties preserved process A, 22 are still in service after 23 years. Of 50 pre- 
served by process B, 18 are still in service. Both sets were subjected to practically identical 
conditions. Is the difference significant? 

18 . An aquarium contains a culture of a certain organism suspended in water. The 
water is well stirred and 0.1 ml is examined on a slide. The number of organisms counted 
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is 13. Estimate from Figure 19 the 95% confidence limits for the number of organisms per 
milliliter of the water. 

19. Use a table of random numbers to draw four random samples of 10 from the popula- 
tion specified in Table 6. Calculate the mean for each sample and compare with the distri- 
bution given in Table 7. 

20. If the cost per unit of sampling in a stratified sample varies from stratum to stratum, 

so that the total cost is c = prove that for fixed c the minimum variance of the 

mean is given when w* is proportional to Ntcrt/ct^^^. That is, more sampling units should be 
picked from a given stratum if it is (a) larger, (6) more variable, (c) cheaper to sample. 
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CHAPTER VII 


SMALL OR EXACT SAMPLING THEORY 

7.1 Introduction. A theory of sampling which assumes that N is large is 
inadequate for many practical problems. In recent years a theory has been 
developed to give more exact methods in dealing with small samples. In the 
practical field, the call for the solution of problems based on comparatively 
few observations was first realized in* 1908 by a young man,* then unknown, 
who published his results under the now celebrated pseudonym of Student.^' 
Since then, many important contributions have been made toward the develop- 
ment and extension of this theory. Its applications are widespread. In the 
opinion of the present writers, continuity between large and small sample 
theory is an essential part of the newer attitude. In general, the methods of 
smaE sample theory are applicable to large samples, although the reverse is 
not true. It is our purpose in this chapter to facilitate an appreciation of some 
of the simpler aspects of this theory. The treatment centers around signifi- 
cance tests for means, variances, and other statistics. 

7.2 Expected Value of 5^. By definition, the variance of a sample is given 
by 

(7.1) s> - + + 

Then the expected value of from repeated samples is 

(7.2) E(s^) = + • • • + XivO } - E(x^) 

Since the x^s constitute a sample we may write 

(7.3) + 0:22 + ... + = NE{x^-) 

It will do no harm, and simplify the algebra considerably, if we assume that 
the origin is so chosen that the mean of the parent population is zero. Then 
vi = 0, and all the fii are identical with, the corresponding Vi, We have, 
therefore, 

(7.4) E(xO = 0 

Moreover, since any t^vo observations cct and Xj may be regarded as inde- 
pendent random variates with the same distribution (that of the parent popu- 
lation), 

(7.5) E(xtX}) = E(x^)E(x3) = 0 
* William Sealy Gosset (1876-1937). See reference 9. 
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and similarly for any expression which contains as a factor any of the Xi 
raised to the first -power. 

From (7.4), 

(7.6) E(x) == 'XE(x^)fN = 0 

Also, F(x^) = E(^Xi)^/N^. Now (^x,y contains N terms like xi^ and 
N{N — 1) terms like XjX,, where i j. Therefore 

E(x) = [NEix,^) + N(N - l)E(x,x,)] N^- 
But E(x^■) = At 2 , and hence by (7.5), 

(7.7) i?(x*) = fi^/N 
We have, therefore, 

(7.8) Eis^) = E(x:-) - E(x^) 

= M2 — M 2 /.V 

N - 1 

= -^M2 


This result is sometimes stated as in the following theorem. 


Theorem 7.1 The mean of the sampling distribution of from an arbitrary 
universe equals the variance of the universe multiplied by the factor* {N — 1)/N. 

It is to be anticipated that the expected value of s^ is less than as the 
following analysis will show. The variance refers to deviations from vi, 
whereas any refers to deviations from an x. For any sample, then, w^e may 
regard vi as an arbitrary origin. Since, in the case of any sample, the sum of 
the squares of deviations from its mean, x, is less than the sum of the squares 
of deviations of the same variates from an arbitrary point n (unless the sample 
is one whose mean falls at j'i) , it is to be expected that the mean of all the values 
of will be less than o-K Relation (7.8) measures the extent of this inequality. 

It follows from (7.8) that B{Nsy(N — 1 ) } = < 72 ^ so that JVsV (N — 1) is an 
unbiased estimate of based on a single sample. Since s^ is defined as 
— x)^]^, the unbiased estimate is ^(xt — xy/{N — 1 ). This latter 
quantity is, in fact, frequently used (e.g,, by R. A. Fisher and by S. S. Wilks 
as a definition of the sample variance and denoted by the same symbol 
With this definition, of course, is an unbiased estimate of cr^. In consulting 
references the student should note which definition of 5 ^ the author uses. It 
is conventional to take 


(7.9) 


== / 

\N - 



as an estimate of <r. If N is large the difference between unity and the 


* This factor is sometimes called ‘^Bessers correction.'^ Perhaps it should be attributed 
more appropriately to Gauss who made use of it, m this connection, as early as 1823. 
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coefficient of ^ is negligible in numerical problems. With N large it would 
not be invalid, to any appreciable extent, to use 5 as an estimate of a. 

If two independent samples are available from the same universe, an un- 
biased estimate based on the two samples is given by 


(7.10) 

where 



q — NiSi^ + iV'2S2^ ^ = ^1 + ^2 


and §2^ being the variances of samples consisting of Ni and N 2 variates, 
respectively. It is left as an exercise for the student to verify that the ex- 
pected value of q/(N — 2) is 

In case k independent samples are available from the same universe, we 
may generalize (7.10) and write 


where 

Q = NiSi^ -h N2S2^ -f- . . . NkSk^ 
U = Ni + N2+--+Nk 


and is the variance in the ith sample consisting of Nt variates. When 
is used in future discussions it will be clear from the context whether this 
estimate is based on 1 , 2, or it samples. 

If N^ = N is the 'same for every sample, (7.11) reduces to 

ry -2 _ A^(gi^ + + * * * + gjfc^) 


where U = Nk. Clearly, (7,12) may be written in the form 
(7.13) - I (si^ + S2" + 83^ + • • • + 


7.3 Degrees of Freedom. In § 7.2 we have proved, essentially, that the 
expected value of — x)^ is (N — l)o-2, where the N values of x in the 
sample are subject to the linear restriction = Nx. This is equivalent 
to proving that the expected value of ^Xi^ is (N — l)<r^ when the x^s are 
subject to the linear restriction '^x^ = 0. Suppose, however, that there are 
k < N linear restrictions on the a:^s. What, then, is the expected value of 
A. T. Craig 2 has proved analytically that if Xi, a;2, • • • X]^ are N inde- 
pendent values of a variable which is normally distributed about zero with 
variance and if the N values of x are subject tok < N homogeneous linear 
restrictions, then the expected value of is {N — fc)(r^. The number 
n = W — fc is frequently called the number of degrees of freedom. 

The concept of degrees of freedom was introduced in § 5.3, in connection 
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with the chi-square distribution, and will recur frequently in this and later 
chapters. It is important to get a clear idea of its meaning. 

A point constrained to move along a fixed curve has one degree of freedom 
of movement — it can move backward and forward only. If it is free to 
move anywhere in a plane or on a fixed surface, such as the surface of a sphere, 
it has two degrees of freedom, and if it is free to move anywhere in ordinary 
space it has three degrees of freedom. We cannot visualize more than three 
dimensions of space, but it is often convenient to use the geometrical language 
of N dimensions. As will be described more fully in § 7.7, we can represent 
a sample of N observations of a variable z by the sample point S in V-dimen- 
sional space, with coordinates Xij X 2 , • • • x^r- Different samples of N will 
correspond to different points S, and if no restrictions are imposed, S has N 
degrees of freedom. This is the case, for instance, when the sample is from 
an infinite normal population with known mean and variance. 

All samples with the same mean x satisfy the condition = Nxj 
which is the equation of a hyperplane in iV-dimensional space. The hyper- 
plane itself is a space of iV — 1 dimensions. If we do not know the popula- 
tion mean but have to estimate it from one sample, we take x as this estimate, 
and then we are, in effect, considering our sample as belonging to the particular 
class of samples whose representative points S lie in the hyperplane ^x^ = 
Nx. In other words, the sample has only V — 1 degrees of freedom for the 
estimation of cr^. 

If is estimated from the sample slbNs^/(N — 1) = '^(Xi — x)^/(N — 1), 
the sample point is taken as lying on the surface of a sphere, == 

of center (x, x, * x) and radius This sphere intersects the hy- 

perplane in a sphere of dimensionality less by one, so that S has now only 
N — 2 degrees of freedom. 

In Chapter V, was used to test the goodness of fit of a normal curve to a 
given distribution, and the number of degrees of freedom was taken as — 3, 
Jc being the number of classes in the distribution. There the degrees of freedom 
refer to the different ways in which a sample might be spread over a fixed set 
of class-intervals. The total size of the sample is fixed, so that only A — 1 of 
the k classes can be filled arbitrarily. Moreover, the mean and variance of 
the parent population are estimated from the sample, so that the degrees of 
freedom are further reduced by 2. 

Various other uses of the ^Megrees of freedom^' concept will appear later. 
A good elementary discussion may be found in an article by Helen Walker in 
Journal of Educational Psychology ^ 31 , 1940, pp. 253-269. 

7.4 Standard Error of the Variance. By the method of § 7.2 we can find 
the exact variance and other moments of although the algebra becomes 
heavy for the higher moments. 
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Thus, 

(7.14) Var (s^) = E(s^) - = 

where ECs^) is given by (7.8). Also 

(7.15) 

= - 2^x,H';^xdvm + ('^x:)*/N* 

Now 

where the first sum contains N terms and the second N(N — 1). Since 
E(xt^) = pLiy and x^ and Xj are independent, we get 

(7.16) = iVM4 + N(N - IW 
Again, 

^Xi^ + '^x,%^ + '^x^%Xk 

The expectations of the second and fourth sums on the right-hand side vanish. 
The first sum contains N terms and the third contains N{N — \) terms. 
Therefore 

(7.17) + N{N - 1)m2^ 

Finally, 

+ ^x^xJXkXl 

The expectations of the second, fourth, and fifth sums vanish. The first 
contains N terms and the third SN(N — 1) terms. (There are N(N — l)/2 
ways of picking out i and ?, and the multinomial coefficient of x^^x.^ is 
4!/2!2! = 6.) Hence 

(7.18) = Nui +■ ZNW - \W 
and, from (7.15), on collecting terms, 

(7.19) E{s^) = M4(JV - DVJV^ + y.iKN - \){m - 2N + Z)/N^ 


Then, from (7.14) and (7.8), 

(7.20) Var (s^) = - lY/m - n^HN - 1)(N - 3)/V® 

= [(N - 1)m 4 - (V - 3);i2^] 


This is an exact result, true for any parent population for which the fourth 
moment exists. If the parent population is normaly 

(7.21) Var ( 5 ^) = 2fX2^iN - 1)/Ar2 = 2aHN - l)/m 

Hence the standard error (s.e.) of the variance is 

2i/2(V l)i/V/V 
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or, if 0 -^ is estimated by Ns^/(N — 1), 

(7.22) s.e. of §2 = s2[2/(iv - l)]i/2 

7.5 The Distribution of s^ in Samples from a Normal Population. By simi- 
lar methods to those used in § 7.4, '‘Student’’® calculated higher moments of 
the distribution of 5 ^ and found that the ske^^mess and kiirtosis were given 
respectively by 

(7.23) 71 - [8/(N - ^ 12/ (iV -- 1) 

sp that 37 i 2 — 272 = 0. This relationship is characteristic of a Pearson 
Type III distribution, and “Student” conjectured that the true frequency 
function of the distribution of is 


(7.24) f(s^) = C(s2)(^-3)/2e-w/2<r2 

This was later proved by R. A. Fisher.* We shall now derive the sampling 
distributions of both the mean and the variance in samples of any size N 
from a normal population of variance <r-: As before, we will assume that 

vi = 0 . 

The joint frequency function for a sample consisting of the N independent 
normal random variates xij ^ 2 , • • • xn is 


(7.25) 

where 


f(xi, x,r- 

F- = -x + sy 

= Ns^ + iVx' 


since ^(a:, — z) = 0. Hence /(zi, za, • • • zjv) is a function of the sample 
mean x and the sample variance 5^. In order to find the distributions of x and 
separately, we require to change the variables from Xi, rr-i, * • • to a new 
set, of which two may be taken as x and s. This can be done either analyti- 
cally or geometrically. Both methods are typical of many proofs in mathe- 
matical statistics, and therefore both will be presented here. 

7.6 The Anal3rtical Approach. This method is direct, but involves con- 
siderable algebra. It depends on an extension to N variables of the tjjicorem 
on change of variables given in § 3.2. 

We change from the set Xi, X 2 , • * • Xn to the set x, s, Wi, u’ 2 , • • • the 
Wij W 2 , • • • WN --2 being chosen in any convenient way Avhich ensures thht 
— X) =0 and We then have 


(7.26) 
where 

(7.27) 


dxi dx2 • * • dx^r == \J\dxds dwi • • • diCN-2 

J ^ J ( ^3, — * Xn \ 

\Xj Sj Wi,--- triyr~.2/ 


* The distribution of s* was actually obtained by Helmert in 1876, but this work was for 
a long time overlooked (see ref. 9). 
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a determinant of N rows and columns. The frequency function in the new 
variables is then 

(7.28) fi(x, s, ‘ • W 2 ^- 2 ) = 1/1 fix, s) 

where f(x, s) is the right-hand side of (7.25), which is a function of the variables 
X and s alone. By integrating (7.28) over the whole range of values of 
Wij W 2 , ^ • wn- 2 . we obtain the joint distribution of x and s. By integrating 
once more over one of these variables we obtain the frequency function of the 
other. The relations between the two sets of variables may be taken as 

Xi X + sN^^^WqW\W2 • * • Wj^^31JDN-.2 
X2 — X + sN^^^WoWiW2 • • • Wn~b(1 — 

Xz = X + sN^^^WQWiW2 • • • (1 — 


Xn^i = X + 
xjv = X + sN'^^^il — 

where we have temporarily introduced a new variable Wq which can be ex- 
pressed in terms of the remaining 

From the first two equations of (7.29), we obtain 

{xi — x)^ iX2 — xY = * • • Wn-Z^ 

Adding (xs -- x)2, we get s'^NwYwt^ • • • wn-a^, and so on. Finally we obtain 

'^{x, - sy = Ns^ 

thus satisfying the second condition mentioned above. 

To satisfy the other condition, ^(Xi ~ x) = 0, we must obviously have 

(7.30) Wo[WiW2 * * • WN-2 + W 1 W 2 *••(!— WN-2^Y^^ 

-I + (1 - tXi2)l/2] + (1 - t^;o2)l/2 == 0 

which expresses wq in terms of the other 
By differentiating each of the equations in (7.29) partially with respect to 
X, s, Wij W 2 , • • • 'Wn ^2 in turn, we see that the first column of the Jacobian J 
consists of all Fs, the second column has a common factor and all the 
remaining columns have a common factor Taking out these factors we 

get 

(7 31) J = 

where D is a determinant depending only on the The frequency function 
is, therefore, from (7,28) and (7.25), 

(7.32) /i(x, a, - Cl I D 1 

where Ci is a constant depending on N and (r. Integrating over 1 ^ 1 , W 2 , * * ^ 
wn- 2 , we get 

(7.33) Mx, s) = 
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Note that in obtaining (7.33) we do not need actually to carry out the integra- 
tion. The integral of D is a constant and is absorbed in the constant C 2 , the 
numerical value of C 2 remaining for the present undetermined. 

Since / 2 (x, s) can be split into two parts, one depending only on x and the 
other only on it is clear that the distributions of x and of s are independent. 
The distribution of x is given 'by 

(7.34) M£) = 

which is a normal distribution with variance <t^/N. The constant C 3 is de- 
termined by J fz{x) dx = 1 , whence Cs = (N/2'ircr^y^^, 

The distribution of s is given by 

(7.35) f4{s) = 

Since 

fz(s^) dis^) = fiis) ds 


where / 6 (s^) is the frequency function for s-, we have 

(7.36) = f4(s)/2s = C 6 ( 52 )C^- 3 )/%-^*V 2 <r* 

This represents a curve belonging to Pearson Type III. The constant C 5 is 
determined from J* fb(s^)d{8^) = 1, and is found to be (see Example 2, § 3.5) 

(.37, ’ = 

It may be noted that the distribution of Ns^/a^ is precisely that of 
with N — 1 degrees of freedom. Putting = Ns^/a^ in (7.36) we obtain 

(7.38) /(x7 d(x^) = (f )/r 


The sum of squares of N independent normal variates measured from 2 , fixed 
origin is distributed as x^<r^ with N degrees of freedom, whereas if the variates 
are measured from the sample mean the sum of squares is distributed as 
xV^ with N — 1 degrees of freedom. As noted in § 7.3, the variates when 
measured from their mean are subject to a single linear constraint. 

7.7 The Geometrical Derivation of the Sampling Distribution of Variance. 
This method (the one used by Fisher) is concise and illuminating.® The 
chief drawback is that geometrical relations, in space of more than three 
dimensions, are not easily apprehended. 

We think of an W-dimensional sample space, in which an actual sample is 
represented by a point of coordinates (xi, Xi, • * • xn)- The probability of a 
sample with values of the variates respectively in the ranges Xi ± dxi, * • ^ 
± I dx^, is f{xi, X 2 , • • * xn) dv, where dv(= dxi dx^* • • dx^) is an element 
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of volume of the sample space located at the point {xi, • • • xn)- The den- 
sity of the probability distribution at this point is/(xi, 3 : 2 , •• • xn)- 
It is convenient to take a new origin at the point (vi, V 2 , ^ ' m)} and axes 
Oui, 0u2, • • • Owy through this point. Therefore, we may represent the 
sample by the point P(%, U 2 , • • * w.v) where Ui — Xi — vi. Although it is 
impossible to visualize a space of N dimensions for JV' > 3, we will carry 
through the argument for the general case by analogy with the case of A” = 3. 
So we consider the latter case first. 

When W = 3, the sample is represented by the point U 2 j uz) and we 
have the mean u and variance defined by 

(a) ui + U 2 + Uz = 3iZ 

and 

(fe) (ui — ny + (uz — uy + (uz — ay = 3^2 


u 


For an assigned u, (a) represents a plane; and, for an assigned pair of values 
of (u, s)j (b) represents a sphere with center at the point Min, Uy u). The line 

(c) ui — Uz = Uz 

has direction cosines each equal to 1/(3)^/^ and is normal to the plane (a). 
The perpendicular distance of P from this line is 

MP = s(3)i^2 

as can be seen from (6). We require the probability, to within infinitesimals 
of order higher than du ds, of getting a sample of W == 3 independent values 
of u which will simultaneously yield values of u and s which lie within the 
region bounded by u + du and s, 5 + ds. If we assume the normal law, 
an element of the joint^distribution function is given by 

where 

dv = dui duz duz 
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As the sample point P(ui, Ut, Uz) varies, u and s also vary. Corresponding to 
different values of s we have a set of concentric spheres defined by (6) . Since 
the plane (a) passes through the common center of the spheres, the region dv 
is a shell between concentric spheres of radii Vz$ and V3(s + ds) and parallel 
planes corresponding to u and u + du. These are the planes (in the figure) 
at a distance apart d{OM). To use a homely illustration, dv corresponds to 
one of the successive layers in a thin slice of an onion. Our problem is to 
express dv in terms of w, du^ and ds. Now the line (c) meets the plane (a) at 
M and the distance OM is 

OM = 

so we have the differential element 

d(OM) = (3)1/2 

Since the plane (a) passes through M, the intersection of the plane and sphere 
is a great circle Avith center at M and radius equal to s(3)i/2. The area of this 
circle is 

A = 37rs2 

and the differential element dA is 

dA = fitrs ds 

Therefore, ^vithin infinitesimals of higher order, 

dv = dA d{OM) 

= Cisdsdu 

where here and hereafter, in this section, the C’s are constants. Hence, the 
required probability is 

dF = 

Passing now to the general case involving AT-space, let P be the point repre- 
senting the sample (wi, • * * un)* Then PM is the perpendicular from P 
upon the line 

id) ui — U2 — un 

and we have 

OM = OP^ = 2 “' 

WP^ = 0P“ - OM* = - Nu^ = Ns^ , 

In V-space, the plane (a) generalizes into the hyperplane 
(e) 2^, = Nu 

and the sphere (b) generalizes into the hypersphere 
(/) - w)2 = Ns^ 

with radius MP = and center at (tl, u, * * • u). Now, the hyperplane 
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(e) will intersect the hypersphere (/) in an (N — l)“dimensional hypersphere 
to correspond to the circle for the case iV = 3. Consequently, for a given pair 
of values of u and s, the point P will lie on an (iV l)-dimensional hyper- 
sphere orthogonal to the line OM, The volume of this (N — l)-hypersphere 
is given by __ 

A == 

and so 


dA = CiS^-^ds 


Therefore, t^e Yohime^v = dui du 2 • • • duN between two concentric spheres 
of radius VNs and VN(s + ds) and two hyperplanes corresponding to u and 
zZ -f du is approximately 

dv = dA d(0M) 

= C$^~^ ds du 


Since du^ == dx^ and du = dx, we arrive at (7.33)* It has been proved by 
Geary ^ that a necessary and sufficient condition that x and s^ from samples of 
N values of x be independent in the probability sense is that the x^s be normally 
distributed in the parent universe. 

In § 7.2, the mean of the sampling distribution of from an arbitrary uni- 
verse was obtained. It is interesting to verify that result in the present case 
where the universe is specialized. The mean of the distribution of variances 
of samples of N from a normal universe is given by 


P(s2) 


X 


d(s^) 


where /8(s®) is given by (7.36) and (7.37). Consequently, we have 
E(s^) = 


Ci 

N 




1 


N 


7.8 The Distribution of the Standard Deviation. The frequency function 
of the standard deviations of samples of N from a normal universe is, from 
(7.36), 

(7.39) Us) = 

% 

SO its mean value is given by 

E(s) = JjMs)sds = 
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(7.40) 


E(s) 


Upon substituting the value of Cs, given in (7.37), we get 

If we denote this coefficient of cr by h(N) we have 

m 

b(N) 

Romanovsky® showed that asymptotically 6(iVj = 1 — 3/4iV' — 7/S2N^ 
Table 10 gives values of the reciprocal of b(N) 
for a few values of N.^ An unbiased estimate 
of <r is s/h{N). 

The rth moment of the distribution of s is 
given by 

(7.41) = 2 C 5 r rfs 


Table 10 


= (l) 


t — 1 + 

i 2 


i^) 


Hence the variancse of s is given by 

(7.42) Var (s) = - n' 

Ar _ 1 


where 


N 

k(N)<T^ 


N 

l/b(N) 

2 

1.772 

3 

1.382 

4 

1.253 

5 

1.189 

6 

1.151 

7 

1.126 

8 

1.108 

9 

1.094 

10 

1084 

20 

1.040 

30 

1.026 

50 

1.015 

100 

1.008 


3 


The approximate value 
(7.43) 


1 

2N 


O', 


1 

8N^ 


16iV3 




is frequently used in practice and this is the basis for the common statement 
that the standard error of a standard deviation is 1/ (2y/^ that of a mean. 

7.9 The “Best” Estimate of the Population Standard Deviation. We have 
seen that an unbiased estimate of cr for a normal parent population is given by 


HN) 




+ 


25 

32N^ 


+ 
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There are, however, several other estimates which have desirable properties 
and which differ somewhat from the unbiased estimate for small samples. 
Thus, the modal value of s, saj' s, is found by differentiating /4(sj and putting 
the derivative equal to zero. We find 

(N - 2)/s - = 0 

so that 

(7.44) s = <t(N - 

The modal estimate of a maj'^, therefore, be taken as 


(N - 2y/-^ 


« (^ + W ■*■ ■ ■ ■) 


It is that value of a which would give, with maximum frequency in all possi- 
ble samples, that sample value of 5 which is actually observed. In practice 
the observed values of the variate will be rounded off, so that the estimate of 
a is subject to a small uncertainty. 

Again, we may enquire what value of <r, say J, would make /4 (5) a maximum 
for a given value of s. Putting 

4 - [cr-(^'-i)e-Jv.V2.r=i = 0 

dcr 


the other factors in fi(s) being independent of cr, we have 

_ jy - 1 Ns^ _ 

or 

_ /'i , 1 , 3 , \ 

This may be called the maximum probahiliiy estimate of o*. It is that value 
of <Tj among all possible population values, for which the observed value of 5 is 
also the most probable value. 

At least two other estimates have certain claims to be called the ^^best.^^ 
The least squares estimate may be defined as that for which the expectation of 
the square of the difference from the true population value is a noinimum. 
That is, JS{ (i — <j)^} is a minimum. 

If we suppose that s = as, where a is a function of N to be determined, we 
find on equating to zero the derivative of E(as — (r)^ with respect to a that 
a = (rE(s)/E(s^) = Nb{N)/(N - 1). Therefore 

/n... ^ sNb(N) r. . 1 . 1 , 1 

(7.46) = +4]V + 3^2+---J 

The maximum likelihood estimate is that value of a for which the whole 
sample actually found is the most probable sample. The probability of a 
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given sample is proportional to exp { — — nY/2a^}. The logarithm 

of the probability is 

(7.47) L = -iV log <r - '^{x, - m) 

There is no value of a for which i is a maximum regardless of the value of n, 
but if we want simultaneous maximum likelihood estimates of a and p. we may 
put dL/d<r = 0 and dLjdp = 0, and so obtain estimates a and p given by 

-N/a + '^{x, - = 0, '^{x, -p)=Q 

Hence 

(7.48) p = '^x./N = X 

(7.49) V* = 2(x. - xY/N = 


The maximum likelihood estimates of y, and <r are, therefore, x and s re- 
spectively. 

We see, therefore, that there is no single best estimate.^ What is the best 
estimate from one point of view may not be the best from another. By con- 
vention, however, the estimate usually selected is the maximum probability 
one. O' = sN^^^/iN — It is not unbiased, but it is the square root of an 


unbiased estimate of cr^. 

7.10 The (x, s)“Frequency Surface. We may regard / 2 (x, s) in (7.33) as 
describing a frequency surface if the volume under the surface represents the 
expected relative frequency of the means and standard deviations of all possible 
samples of size N. In depicting this surface it is convenient to let w = x — vi 
so that the origin of w is at x == vi. 

Since 



/(x, s) dxds — 1 


then the volume under the surface over a closed contour in the w5-plane 
represents the proportion or percentage of samples whose means and standard 




Fig. 20. The Surface /(55, a) Illustrated by Sections 
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deviations fall simultaneously witlnn the ranges defined by the boundary 
of the given contour. In an illuminating paper ^ by Deming and Birge two 
such frequency surfaces are represented. These are reproduced in Figure 20, 
one for a small value of N and the other foi a comparatively large value of N. 

As the authors point out, the highest point of the surface has the coordinates 
u = 0, s = (7{ (iV — 2)/iV}^^“. Because of the independence of x and s, all 
plane sections s = constant will be normal curves with standard deviations 
equal to cr/(iV)^^^. The u = constant sections will be skew^ curves whose 
equations are given by f 4 (s). They will all have the same mean and mode. 
As N increases, their mean and mode approach coincidence with the value a 
while the curves lose their skewness and become normal with center at s = cr 
and standard deviations equal to cr/(2iV)^^^. As N' increases, the surface 
becomes more and more concentrated about the point u — 0, s — a. 

7-11 “Student’s’’ z-Distribution. The formula used in testing a null h^^- 
pothesis that a given sample comes from a universe with a proposed mean is 


(7.50) 


{x - 

cr 


As stated in Chapter VI, (7.50) is normally distributed if the universe is nor- 
mal. In practice, a is seldom available and usually must be estimated from 
the data available. If we substitute into (7.50) the estimate of or given in 
(7,45) and calculate 


(7.51) 


{x - ix)(N - 1)^/^- 

5 


we are not justified in asserting that t is normally distributed unless N is large. 
And so, in testing the significance of the mean of a small sample we are not 
justified in referring ^ to a normal probability scale. The variability of s from 
sample to sample invalidates that procedure. 

While Helmert obtained the distribution of as early as 1876 it seems that 
“Student” ® was the first to recognize the importance, for the theory of small 
samples, of taking account of the variability of s in (7 51). “Student” 
actually found the distribution of a slightly different variable, namely 

(7.52) ^ ^ 

Obviously, z is functionally related to t by 

(7.53) z = t(N - l )”'/2 

so the distribution of t cap easily be obtained from that of z. From (7.52) w^e 
obtain dx = s dz for a fixed Value of s. Substituting in (7.33), for which jx is 
supposed zero, we obtain, for the joint distribution of 5 and z, 

(7.54) ds dz 
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This expression is defined for s > 0, since s is taken as the positive square root 
of s®. If s is integrated out of (7.54), the distribution of the single variable z is 
obtained. To perform this integration, let 


y = sil + ds = (1 + z 2 )-V 2 dy 

Integrating with respect to y from 0 to oo, we have 


which reduces to 


k{ dy } (1 + z2)-^/2 dz 

K{1 + dz 


where, as sho\\Ti in Example 3, § 3.5, 


(7.55) 




Therefore, the frequency function for Student’s^’ z is 


(7.56) 


m 



(1 + 


The curve is symmetrical with mean zero and infinite range. It is quite 
different, however, in mathematical character from the normal curve although 
it approaches this form as iV” oo . From the viewpoint of sampling theory 
the important property of (7.56) is its independence of cr. The revolutionary 
character of this property is revealed in certain applications that involve 
drawing probable inferences from small samples, say from a sample of i\r == 10. 

7.12 “Student’s^’ ^-Distribution. Substituting (7.53) into (7.56) and replac- 
ing iV — 1 by n we obtain 

/ /2\-(n+l)/2 

(7.57) /.(O =Kn(l + y 


where l/Kn = 1/2), B being the Beta function. 

Inasmuch as (7.57) is independent of c, it can be used in situations in which 
the value of cr is unknown- The quantity t involves no hypothetical quantities 
except IX, being otherwise completely expressible in terms of the observations. 
In 1925, Student published in Meiron an extensive table of the probabil- 

fn(t) dt More recently, Fisher has given a short table of the 

probability P of occurrence of deviations outside fer values of t and n 
commonly met in applications of small sample theory. Let 


ity integral 


I 


Pn(t) = 




(t) dt 
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Then the probability P tabulated by Fisher is 

P = 1 - P.(0 


Several ^-tables are available.* The one in Statistical Tables by Fisher and 
Yates, shows for n between 1 and 120 the values of t for which P takes the 
values given at the head of the columns. The number n, with which to enter 
the table, is determined by the number of degrees of freedom involved in the 
available estimate of In testing a null hypothesis if?, if a computed value 
of (7.51) is larger than the tabular value for the level of P selected, then H is 
rejected at that level of significance. Or, if one prefers, he may note the tabu- 
lar value of P for the computed value of (7.51) and use the rule in § 6.7, where 
now, of course, Ps is to be replaced by P. 

The distribution of t (as well as that of z) approaches the normal form as 
n 00 . This may be established as follows. Using Stirling's approxima- 
tion on the coefficient Kn in (7.57) we obtain, after some algebraic simplifica- 
tion, the following expression: 


Kn = e-V2(^^)-l/2 


- l Y^-2)/ Yn - l y/ Vn - I V/^ 

n- 2 ) U - 2/ V 2 / 


From this it is easy to show that 

lim Kn = ( 27 r )-‘'/2 


The rest of the t function may be written as 

which, when n = oo , becomes Therefore, 

lim /n(0 = (27r)-‘/2e-‘’/2 

n— > 00 

The entries in the last line of Fisher’s table, corresponding to n = 00 , are the 
deviations from the mean of a normal curve with unit standard deviation. 
The variance of t is given by 

fn(t)t^ dt = n/(n — 2) 

Hence t[{n — 2)/nYi^ is a standard variate, and, as shown above, is approxi- 
mately normally distributed when n is large. In applications, therefore, it 
is frequently satisfactory to refer 

* Table IV in our Appendit' is an abridged version giving JP forn =« 1 to 30. Obviously, 

foo 

= J fn(t) df = 1 — FnQ) where F„(0 is the distribution function of t. Another table is 

given by M. Herrington in Biometrika, 32 , 1941-42, p. 300. This gives values of i for values 
of P (0,005, 0,025, 0.25) not included in StcMsUcal Talks 
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(7.58) 




^ {x - n)(N - 
s 


to a normal probability scale when N > 30. 


Example 1. For a random sample of 10 the mean is 12.1 and the stan(|iard deviation 3.2, 
Is it reasonable to suppose that this sample came from a normal parent population with 
mean 10? 

Solution, Here 

< . - 1.W 

with 9 degrees of freedom. The probability of getting a value of t numertcally as great as 
this, on the assumed hypothesis, is about 0.08. 

The probability of getting a value as great as 1,97 is one half of this, 0.04. Hence although 
the hypothesis is not definitely unreasonable, it is rather doubtful if the sample could have 
come from a population with a mean as low as 10. 

It may be noted that, if t were a standard normal variate, the probability of obtaining a 
value as large as 1.97 would be only 0.024. 


The distribution of the '^Student t statistic is also obtainable quite readily 
from the remark in § 7.6 that is distributed as ^^th iV — 1 degrees 

of freedom. For, since 

(N- lyiHx - fx)/s 

^ (x - py ^ - m)A] ^ 

N -1 


which is the quotient of the square of a standard normal variate and a 
variate. Now, by Theorems 5.2 and 5.5 the numerator is a yd) variate and 
the denominator is a y[(N — l)/2] variate. Hence by Theorem 5.4, the 
quotient is a /3'd, (N — l)/2] variate. 

The frequency function of is therefore given by 


(7.59) 


m d(t^) = 




where n = iV — 1. 

Hence the frequency function of t is 



(7.60) 




The factor 2 does not appear, because t goes from — oo to <» , whereas goes 
from 0 to oo', and the constant is so adjusted that 

rf(t)dt = 1 

J-K 0 

From the above argument we obtain . - 

Theorem 7:2. Any statistic t has the "Stvdent" t-disiributim for n degrees 
of freedom if t^/n is the ratio of two independerd variates distributed respectively 
'OS X* wilh 1 and n degrees of freedom. 
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7,13 Difference Between Two Means. Fisher demonstrated that (7.57) 
has a much wider range of application than the problem for which it was 
designed. He showed that the ^-distribution is applicable whenever we are 
dealing with a normally distributed variate w^hose standard deviation is not 
known exactly «Jbut is independently estimated from observations amounting 
to n degrees of freedom. The scheme by which the “ Student’^ idea is made 
available to other problems consists in constructing a variable t in the nature 
of a fraction whose numerator is any statistic normally distributed and whose 
denominator is the square root of an independently distributed and imbiased 
estimate of the variance of the numerator involving n degrees of freedom.* 
Thus the ^-distribution has been found useful in such problems as testing the 
significance of the difference between two means and testing hypotheses re- 
garding regression coefficients. 

Let xij X 2 be the means and si, $2 the standard deviations of two independent 
samples of Ni and N 2 variates, respectively, from a normal universe with 
mean y, and variance cr^. According to Theorem 4.11 the variance of the 
difference between the two means is (r\N\ + N<^/NiN 2 - Then it follows 
that the variable 

:ri-;r2 f N 1 N 2 1 
(T \m + N2} 


is normally distributed with unit standard deviation. However, in most 
practical problems cr is xmavailable and must be estimated from the samples. 
Using the unbiased estimate of <r^ defined in (7.10), the above formula becomes 


(7.61) 


^1 X2 { N1N2 1 

J \Ni+N2) 


Fisher showed that (7.61) is distributed in accord with (7.57) for 
n — Ni + N 2 — 2j and we can find from Fisher^s table of P the probability 
of a greater difference between the means than that observed. 

As Ni and N 2 become large, (iNTi + iV' 2 )/(iVi -f iV 2 — 2) tends toward 
unity and (7.61) tends toward the value 


(7.62) 


Xi ~ X2 


Since (7.61) is asymptotically normally distributed, the older procedure of 
referring (7.62) to a normal probability scale in testing a null hypothesis that 
two samples are from the same universe would not be invalid to any appre- 
ciable extent for large yalues of Ni and N 2 - Kenney has called attention 
to a formula which is cominonly used in place of (7.62). This formula has 
+ S 2 ^/N 2 }^^^ in the denominator. It is approximately valid if we 


* A statistic so treated is often said to be ^'studentized.'* 
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have reason to believe that the two samples come from populations with 
different <ri and <r 2 . (See § 9.8.) 

If one of the samples, say N 2 , is so much larger than the other that it tends 
toward the universe, then X 2 tends toward fx and S 2 tends toward o-. So, under 
these conditions, (7.62) tends toward 

i'r = (^1 - 

a 

which, if the subscripts arc dropped, is the formula used in testing a null 
hypothesis that a given sample comes from a normal universe with a proposed 
mean. When Ni = N 2 = Nj (7.61) reduces to 

(7.63) ^* 2 ) ^ n = 2(Ar ~ 1) 

Inasmuch as we do not ordinarily know whether a sample is drawn from a 
normal universe or some other type of universe, a question quite naturally 
arises as to whether the procedure inaugurated by “Student’^ and extended 
by Fisher is applicable to small samples from non-normal universes. The 
question may be considered partially answered by Bartlett and others who 
have shown that it gives a good approximation for considerable departures 
from normality in the sampled universe. However, a word of caution seems 
to be in order lest the procedure be oversold in the applications by completely 
neglecting the underlying assumptions of normality in the universe and ran- 
domness of the samples. 

The following example, cited by Rietz,^® illustrates the Student ^theory. 

Example 2. The following data represent the yields in bushels of Indian corn on ten 
subdivisions of equal areas of two agricultural plots in which Plot 1 was a control plot 
treated the same as Plot 2, except for the amount of phosphorus applied as a fertilizei. 


Plot 1 

Plot 2 

6,2 

5.6 

5.7 

5.9 

6.5 

5.6 

6.0 

5.7 

6.3 

5.8 

5.8 

5.7 

5.7 

6.0 

60 

5.5 

6.0 

57 

5.8 

55 

10 ) 60.0 

10 ) 57 .a “ 

Xi = 6.0 

= 5.7 


Is there a significant difference between the yields on the two plots, using the difference 
between their means as a criterion of judgment? 
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Solution, 


Substitution in (7.63) gives 


52® 


0.64 

10 

0.24 

10 


= .064 


= .024 


f 9 1 

‘ 588 } 

= (.3) (10.1 13) - 3.034 


Entering “Student’s” tables in Metron (loc, ait.) at ti = 18, we find P = .0072 for the 
probability that t will fall outside the range —3.034 and +3.034. Hence a null hypothesis 
that the samples are from the same universe would be refuted by the test for both the .05 
and .01 levels of significance. In other words, our conclusion is that, on the levels of signifi- 
cance adopted, there is a significant difference between the yields on the plots. 


7.14 Fisher^s ^-Distribution. Suppose and are two independent and 
unbiased estimates of the variance of a variable x which is normally dis- 
tributed. If these estimates are based upon samples of Ni and N 2 y respec- 
tively, or upon ui and n 2 degrees of freedom, then we have 

1 Vi 1 nt + l 

1 Nt 1 W24-1 

= nT^i = ;^ 2 

in which xi and X 2 are the means of the two samples. In previous notation 
11 ^ and would be denoted by S-i^ and 0 - 2 ^, but these symbols are too unwieldy 
in the present discussion. 

In constructing a test of significance for the difference between two sample 
variances it might seem logical to form the difference w = — v'^ and seek 

the frequency function of w. However, such a procedure is impractical 
because of the mathematical difl&culty involved in determining this function. 
Fisher circumvented this difficulty by building a statistic, 0 , defined by 

(7.64) 2 = idoge — loge v^) = loge ^ 


whose frequency function, g(z), he obtained and which proved to have ex- 
tremely wide application. To derive g{z) we make use of t he dis tribution of 
given in (7.36), replacing iV" — 1 by n and by {n/n + 1)^^. After 
this modification, (7.36) becomes 


(7.65) 


f{u^) d(u^) = 


f n 1 
120-2 1 



(^2)(n-2)/2^-.nuV2«r2 


Since and are independent their joint distribution is 

(7.66) Jf(^2)(ni-.2)/2(^2)(«2--2)/2g~(«ii^4-«2^^ ^(^2) ^(^2) 
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where 


From (7.64) we have 


K = 




2(?ll+«2)/2^(»l+n2)p 



(7.67) 

and for a fixed value of 


(7.68) 


d(u^) == 2vh'^^dz 


Using (7.67) and (7.68) in (7.66) we obtain 

(7,69) (i(e^^) dz 


for the joint distribution of and z. Integrating with respect to between 
the limits 0 and oo and making use of the Gamma function we obtain the 
distribution of 


(7.70) 


giz) dz = 


2^ini/2y^^n2/2 

T> /ni n 2 \ (nie^* + 712)^^''^^^^^ 

®i2'2j 


The function g(z) has the important property that it depends solely upon 
ni and rh, not at all upon the variance of the sampled universe. Fisher’s z 
should not be confused mth the 2 :-distribution of ‘^Student.” 

The 2 -distribution is extremely general, including as special cases the x““dis- 
tribution, the ^-distribution of ‘^Student” and Fisher, and the normal distribu- 
tion. Rider has made easily available the transformations and substitutions 
by which these special cases can be obtained from (7.70). 

The positive part of the curve for z = log^ (u/v) is the same as the negative 
part for z = loge (v/u). Since it is optional which estimate is considered as 
it is usual, in tabulating the probability integral of g{z)^ to consider only 
positive values of z by making the larger variance estimate (based on ni 
degrees of freedom). 

rza 

Let Q = I g(z) dz and let P = 1 — Q. Thus P is the probability that 

2 > 2o. In his book, Fisher has given values of zq corresponding to the 
probabilities P = .05, .01, and .001, for various combinations of rii and 722 . 
These values, 20 , are called the ‘^5%, 1%, and 0.1% points” and are used as 
critical values in judging significance. In practice, however, tables con- 
structed from the P-distribution (see §§ 7.15 and 7.16) instead of the 2 -dis- 
tribution are commonly used. 

7.15 Significance of Difference Between Variances. The usual hypothesis 
tested by the 2 -test is that and are estimates of one and the same popula- 
tion variance and therefore that 2 = 0. The significance of the divergence of 
the observed value of 2 from zero is the crux of the test. Small values of z 
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mean a tenable hypothesis whereas values of z larger than zq refute the 
hypothesis. If for P == .05 (or .01) the observed value of s, as computed from 
the samples in accordance with (7.64), is larger than 20 , the hypothesis is to 
be rejected and the conclusion is that the samples come from universes with 
different variances. 

The z-tesi may well be applied before testing the difference between two 
means since the latter test depends on the equality of the population variances. 

To avoid the troublesome logarithmic computation involved in (7.64) 
Snedecor^^ has published tables which give 5% and 1% points for the ratio 
u^/v^j where = u^/v^, Snedecor calls this ratio F in honor of Fisher.* 
Therefore, 



where is io he chosen the larger of the two given variance estimates. This 
table is reproduced in the Appendix. (See Table II.) 


Example 3. In Example 2 suppose we wish to test the assumption, which was made there, 
that the two samples come from universes wuth equal variance. We have 


v?- 

_ + 1 


= .0711 


ni 

9 



no 4- 1 

, .24 

= .0267 



52 ^ = — 


712 

9 


F 

.0711 
.0267 ■ 

= 2 663 



z = .5 log« F 
= 1.1513 logioF = .49 


Entering Fisher’s table {loc. ait ) for m = ?i 2 = 0 we find zq — .58 for P — .05 and 20 = .84 
for P - .01. This means that, if the true value of z were zero, random sampling fluctuations 
would be expected to give a value of z as great as .84, or greater, once in 100 trials, and a 
value of z as great as .58, or greater, five times in 100 trials. The observed value of z is .49 
and so this value might be accounted for by chance, at either the .05 or .01 points of signifi- 
cance. Using Snedecor’s table we find F — 3.18 for P = .05 and F — 5.35 for P = .01. 
Since the observed value of F is only 2.663, the hypothesis that the variances are equal is 
not rejected. 


7.16 The Distribution of F. Since F = NiSi^/ni -r- N 2 Si^/n 2 j it follows that 
niFf n 2 is the ratio of two variates distributed as with ui and W 2 degrees of 
freedom, respectively. It is therefore the ratio of a y(ni/2) variate to a 
7(%/2) variate, and so is a i(3'(ni/2, ^ 2 / 2 ) variate, with frequency function 


(7.71) 


m = 


\n2 J \ m ) 

^ 2 / 


1 -f 



0 < F < 00 


* In Statistical Tables by Fisher and Yates it is called the variance ratio. The 3rd edition 
(1948) gives tables for 0.1%, 1%, 5%, 10%, and 20% points. More extensive tables by 
M. Merrington and C. M. Thompson are available in Bionieinka^ 33, 1943, pp. 73-88. 
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where 


(7.72) K = 

The expectation of niF/n 2 is, by (5.14), (ni/2)/[(n2/2) -» 1], so that the 
expectation of F is n^/in^ — 2), provided, of course, that > 2. This is 
always greater than unity and is independent of ni. The modal value of F, 
given by differentiating (7.71), is [n 2 /(n 2 + 2)][(ni -- 2)/ni], which is always 
less than unity. 

When ni = 1, the distribution of F is the same as that of P" with n% degrees 
of freedom. 

When 712 = 00 , § 2 " == so that niF is distributed as with rii degrees of 
freedom. 

When ni = 00 , n^/F is distributed as with degrees of freedom. When 
na == 00 and ni = 1, the distribution of VF is normal. 

The table of F given in the Appendix gives only the upper tail. For testing 
values of F < 1, we may take the reciprocal of F and interchange the degrees 
of freedom ni and ^ 2 , although this is seldom necessary. 

It should be observed that the 5% and 1% points of the F-table (and also 
of the ;s-table), when used to test the equality of two estimates of variance, 
refer to 10% and 2% levels of significance. The reason for this will be clear 
on considering the hypothesis that is being tested. 

Suppose we have two samples of sizes ni + 1 and n 2 + 1 from normal 
populations with means iU(i), fX(^ 2 ) and variances cri^, 0 - 2 ^, respectively. Let 
and let the hypothesis Hq be that ^ = 1, regardless of the actual 
values of <r 2 ^, mcD) and jLt( 2 ). Let A 12 and be two numbers, depending on 
ni and rn, chosen so that for any given ui and 



f(F) dF = 1 


a? 


0 < a < 1 


If then we agree to reject Ho when either F < A 12 or F > Bn, the proba- 
bility of rejecting Ho when it is true will be 1 — Pr{Ai 2 < F < £ 12 } = a, so 
that a is our level of significance. 

Moreover, a confidence interval for 0, corresponding to confidence coefficient 
1 — a, will be given by F/Bn < 6 < PjAn, since the probability that the 
true value of 6 is covered by this interval when Ho is true is 



< d< 


= 1 1 = Pr{4i2 <F < Bn} = 1 


a 


Now A 12 and S 12 may be chosen in different ways. If w’-e wish for a test 
that is equally sensitive for ^ < 1 and ^ > 1, it v;ill be natural to use both 
tails of the F-distribution and to put 


/*oO 

j f(F)dF^ I f(F)dF 

Jq JjBit 


— 1 , 
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This implies that An = 1/Bn, where Bn is the number obtained from Sis 
by interchanging wi and since if we put u = 1/F, we obtain 



f(F) dF = 


(n2/ni)’^'^ 
B(§ni, Ins) 



— Cni + r7;)/2 

U^n2-l 


and the right-hand side is the F integral with ni and interchanged. 

Because of the relation betAveen ^12 and Buj it does not matter which sample 
Ave take as number 1 and which as number 2. With one arrangement we shall 
reject Hq unless A 12 < F < Bu and Avith the other arrangement we shall 
reject it unless ^4.21 ^"^/F < B%i, But it is easily seen that if ^12 = IJBn 
(and hence A 21 = l/5i2) the conditions of rejection are identical. We can, 
therefore, agree to use only values oiF > 1, and so are interested only in the 
upper tail of the ^-distribution. 

The values of Bn are given by Snedecor’s tables for = 0.05 and 0.01, 
and hence correspond to a = 0.1 and 0.02. That is, the 5% point corresponds 
to a 10% level of significance. 

If, on the other hand, we want to reject ffo if ^ > 1 but do not mind accept- 
ing it if ^ < 1 (a situation which is common in the analysis of variance. 
Chapter IX), we shall want to use a one-tailed test. We choose Bn so that 

r /(F) dF = a, and reject ilo if F > Bn. The values of Bn are still given 
JBx2 

by Snedecor’s tables, but they now correspond to significance levels of a = 0.05 
and 0.01. The confidence interval for 6 is from F/Bn to 00 . 

It may be proved that if 

ni X 


then a; is a Beta-variate with distribution function equal to the incomplete 
Beta-function Tables have been computed by Miss C. M. 

Thompson {Biometriha^ 32, 1941, pp. 151-181) giving values of x correspond- 
ing to Ix = 0.005, 0.01, 0.025, 0.05, 0.10, 0.25, and 0.50. From these values 
the extensive F tables cited in the footnote to § 7.15 were calculated. 

7.17 Confidence Limits, (a) For the mean. Let x and s be the mean and 
standard deviation of a sample of iV = n -f 1 items drawn from a normal uni- 
verse mth unknown mean fi. The problem is to determine an interval sur- 
rounding X in which we may assume, A^vith a certain degree of confidence, that 
M is contained. We have seen that the variable 





is distributed in accord Avith the fn(t) curve and that P = 1 — P^it) has 
been tabulated for various values of t and n, where 
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Therefore, for an assigned € and for an assigned value of n, (n < 30), we may 
obtain from the tables upper and lower critical values of t by solving the equa- 
tion P == 2€. With these critical values we can determine the required inter- 
val surrounding x for the given value of e. It is conventional to take e = 0.005 
(or 0.025) since we wish to determine confidence limits dividing hypotheses 
that will be rejected from those acceptable at the 1% (or 5%) level of signifi- 
cance. 

Suppose, then, that we make the claim 


(7.73) 



< M < X + 


Vn 


and we desire the probability of an error in this statement to be not more than 
2€ = 0.01. Taking n == 15, for example, we find from Table IV, that 
t == ±2.947 when P = 0.01. Then we have 


and the claim 


(x - m) 


VTs 


= ± 0.76s 


X — 0.76s < < X + 0.76s 


will be correct 99% of the time. 

It is clear from the above procedure that our confidence in the limits 
X ± U8/\^n IS measured by the area under the /n(0 curve inside t ^ ±4, 
that is, by PniQ- This means that, if we could observe all possible samples, 
the proportion represented by Pn{Q would yield values of x and s for which 
the claim (7.73) is true, while the remaining proportion, P = 1 — Pn{Q, 
would yield values of x and s for which the claim is false. 



If we were testing a hypothetical value of yu wfe would say that x is not 
significantly different at the 1% level of significance if fk has any value in the 
X ± t^s/Vn interval, c = 0.005. If m does not lie in this interval we say 
that X is significantly different at this level. 
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Obviously, values of t satisfying the equation P = 0.01, that is, Pn(0 = ^1*99, 
vary with n. To avoid the trouble of entering a table we give an alternate 
method which is valid when the sample is not small. Recall that the variable 

__ (x - ^)(N- 3)^/^ 
s 

is approximately normally distributed when iV > 30. The area under the 
normal curve outside f = +2.576 is 0.01. Therefore, the 99% confidence 
interval of m is then 





and the interval gets smaller as N increases. 

(6) For the difference between two means. Let xi and si^ be the observed 
mean and variance of a sample of Nt drawn from a normal universe with 
unknown mean ju(i) and let X 2 and be the observed mean and variance of a 
sample of N 2 drawn from a normal universe with unknown mean It is 
assumed that the. two universes have a common variance For brevity, 


let 





w = Xi — Xi, CO 

= M(i) - 

M(2)! ^ = ^1 + ^2 






- Ll N 

- 2 . 

1 l NxNi jJ 

Then 




(7.74) 


t = 

CO 


CTg, 


is distributed in accord with/n(0 for n — JV" — 2. From (7.74), upper and 
lower confidence values of w can be found by assigning to i the solutions of 
Pn(t) = 0.99, that is, of P = 0.01. If the value a? = 0 falls outside the 
confidence interval thus established, the conclusion is that the diflference 
between the means is significant at the 1% level. That is, w 0 and hence 

Mil) ^ MC2). ' ^ 

If the two samples are equal in number and if the variates are paired in some 
manner we may compute (7.74) by a different method. Let N = Ni — 

N 

w == xi — X 2 , and compute w and Then 


(7.75) 



W -- (j) 


■ ^ ” 11/2 
- wY 

,N(N -1) ^ 


The last expression is sometimes called BesseVs Formula. 
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Example 4 (Snedecor Imagine a newly discovered apple, attracfive in appearance, 
delicious in flavor, having apparently all the qualifications of success. It has been christened 
“King,” Only its yielding capacities in various localities are yet to be tested. The following 
procedure is decided upon. King is planted adjacent to Standard in 15 orchards scattered 
about the region suitable for production. Years later, when the trees have matured, the 
jdelds are measured and recorded in the follovrmg table wKere Xi refers to King, X 2 to Stand- 
ard, and w = xi — x> The yields are in bushels. 


Xi 

X2 

w 

(w ~ w)^ 

13 

11 

2 

16 

12 

6 

6 

0 

10 

3 

7 

1 

6 

1 

5 

1 

13 

7 

6 

0 

15 

10 

5 

1 

19 

9 

10 

16 

10 

4 

6 

0 

11 

3 

8 

4 

11 

6 

5 

1 

13 

8 

5 

1 

9 

5 

4 

4 

14 

7 

7 

1 

12 

6 

6 

0 

12 

4 

8 

4 

Totals 

90 

50 


Substituting in (7.75) we get 


— 6 — cij __ 6 — CO 

"" r 50 “ 0.488 

La5)(14)J 


Entering Table IV for n 
equation 


14 we find that F ~ .01 when f = 2.977. Then solving the 
6 - - 


0.488 


= ±2.977 


we obtain co = 4.55 and a? = 7.45. Since co = 0 is outside the interval from 4.55 to 7.45, 
the observed value of w differs significantly from zero. In other words, we would reject (at 
the 1% level of significance) the null hypothesis that there is no significant difference be- 
tween the yields of the two varieties. 


It is important to note that the circumstances in which (7.74) and (7.75) 
apply are quite different. In using (7.74) we are assuming that we have two 
completely independent samples from two normal populations with the same 
variance. We make the null hypothesis that the means are also the same, 
and construct the confidence interval on this assfhnption. If the observed 
difference of means is too great we reject the null hypothesis. The number 
of degrees of freedom is Ni + N 2 — 2. 

In using (7.75), however, the two samples are not assumed to be inde- 
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pendent. They consist of N pairs, the members of each pair differing perhaps 
in their values of x, but otherwise as identical as possible. In Example 4, 
the two varieties of apple are planted adjacent to each other, so as to minimize 
any differential effects of moisture, soil fertility, etc., on the 3 delds. The 
degrees of freedom are here only N — 1. 

(c) For the variance. As noted in § 7.6, Ns^/a^ has the distribution with 
n = N — 1 degrees of freedom. 

To determine the confidence limits of tr^ we first observe that Ns^ — na^ = 


N 

— xy^ and therefore we may write = na^/cr^. 


claim 


n<T^ 

X22 


< 0-2 < 


xy 


If now we make the 



where and X 2 ^ are arbitrarily chosen constants (xi^ < x/)f then our 
confidence in this claim is measured by Jn(xi^) “ I nix/)} where 

/n(x*) = dx- 

Values of /n(xO can be obtained from Pearson’s TableSy^ or from Appendix, 
Table III. 

More extensive tables have been calculated by C. M. Thompson, Biometriha, 
32, 1941^2, pp. 187-191. These give x" for P = Jn(xO = 0.005, O.OJ, 0.025, 
0.05, 0.10, 0.25, 0.60, 0.75, 0.90, 0.95, 0.975, 0.99, 0.995 and for = 1(1) 30,* 
40, 60, 60, 70, 80, 90, 100. 

(d) For the standard demotion. The frequency function of s is given in (7.39) . 

An unbiased estimate of <r from a sample value s is s/b(N), where l/b{N) is 
given by Table 10. Values'of s/o at selected probability points are given in a 
table compiled by Croxton and Cowden and reprinted as Appendix in 
their textbook Practical Business Statistics, from which several sets of con- 
for all n between 1 ard 30 at intervals of 1. 
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fidence limits may be calculated. Alternatively we may find limits for s‘^ 
and take the square root. 

7.18 Standard Errors of the ft-Statistics and of gi and g 2 . The fc-statistics 
were defined in § 5.8 as follows: 

ki = ni = m 
h = Nmi/iN - 1 ) 
h = N^ms/iN -1)(N - 2 ) 

h = m(N + l)mi - 3(N - l)m2V(N - 1}(N - 2)(N - 3) 

We have already proved that, for any parent population for which the moments 
up to the fourth exist (see (6.14), (7.8), (7.19)), 

E(ni) = vi 

Eirrh) = (N - 1)^2/N 

EM = (N- l)[(iV - 1 )m4 + (N^ -2N + SW]/N^ 


By similar algebraic calculations it may be shown that 

E(mz) = (N-1)(N - 2)nz/m 

Eirrii) = (W - 1)[(N^ - ZN + 3)/i4 + 3(2iV - 3)M2']/Af® 


Hence, bearing in mind the definition of the cumulants Kr given in (4.65), 
we see that 


(7.76) 


E(h) = ici 
E(k^ “ 1^2 
E{h) = 
E(kt) — Ki 


From (5.64) and (7.20) it follows that 

(7.77) Var (fca) = NHN - 1)-=“ Var (m 2 ) 

= N-'^Ki + 2{N - 1)-‘k2=' 


From (7.19) and (7.76), we can show that 

E[2k2\N + 1)-^ + ki{N - 1)N-KN + l)-»] = KiN-^ + 2 k2HN - l)-» 
Therefore an unbiased estimate of Var (fca) is 

(7.78) [2fe* + (V - Dki/m/iN + 1), 

and the square root of this is the standard error of kt. 

If the parent population is known to be normal, 

(7.79) Var (fe) = 2(iV - 1)-^ 
and 

(7.80) EW = (JV* - l)JV-*4* ^ 

Hence an estimate of the variance of 1:2 is 

(7.81) 2«W=(V - l)-i(V2 - 1)-' = 2(iV + 1)-‘1:22 
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and the square root of this is the standard error of The result (7.81) is 
obtained by putting fc 4 = 0 in (7.78). 

It may similarly be proved that for any parent population 

(7.82) Var (fca) = - 1)~^ + %kzHN - 1)''^ 

+ 6/c2W(iV^ - irKN - 2)-i 

For a normal parent population this reduces to 

(7.83) Var (fca) = WN{N - lyKN - 2)-^ 

Since, from (7.41) with r = Q, 

E(s^) = N-HN -1)(N+ 1)(N + 3W 
an unbiased estimate of Var (kz) is 

Qs^NHN - 2yHN ~ 1)-HN + + S)”! = 

6WfV - 1)(V - 2yHN + lyHN + 3)-i 

and the square root of this is the standard error of kz- 
A similar argument shows that an unbiased estimate of the variance of ki^ 
for a sample from a normal parent population is 

(7.84) Var (k,) = 24feW(V - 1)HN - d^HN ~ 2yHN + + 5)"i 

The variances of gi = kz/k^^^"^ and of = k^^jk^^ for samples from a normal 
parent population, were worked out by Fisher who found that 

(7.85) Var (^i) - - 1)(N - 2)“KV + lyHN + 3)-i 

and 

(7.86) Var (g^) = 2m (N - l)HN - ~ 2yHN + Z^HN + 5)-i 

For large N these approximate to the values 6/N and 24: /N given in Chapter 
VI. 

7.19 The Distribution of Extreme Values and of the Range. Let xi, X 2 , 

• • • aiiv be a set of sample values, arranged in ascending order, from a parent 
population with distribution function F(x). The probability that all N indi- 
viduals in the sample lie between — oo and x is [F(x)]^, since by definition 
F(x) is equal to the corresponding probability for any member of the popula- 
tion, But if all the Xi lie between — oo and Xj so does xn^ Hence the dis- 
tribution function for xjv is 

G(x) = PrfoJisr < x} [F(x)]^ 

If F(x) is known, G(x) can be found. For a normal parent population 
F(x) — <i>(x), and L. H. C. Tippett has calculated a table for selected values 
of N between 3 and 1000. •'This is reproduced in Tables for Statisticians and 
Biomeiridans, Part II (Table XXI). Since the normal distribution is sym- 
metrical, the table serves equally well for the least value, Xi. Thus 

Prfxiv < x] — Pr{xi > — x} 
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In this table the unit is the standard deviation of the parent population and 
the are measured from the population mean. 

The table may be used to determine whether an outlying sample value 
should be rejected as being too large (or too small) to be reasonably regarded 
as a random fluctuation. For this purpose, Tippett and Egon Pearson have 
calculated a table (loc. cit,. Table XXI bis) from which Table 11 is a brief 
extract, giving for various sample sizes the deviations which are exceeded by 
the extreme variate in the stated percentage of cases. 


N 

Table 11 

s % 

1% 

1 . . . 

.... 1.645 

2.326 

5. . . 

. . . . 2.319 

2.877 

10. . . 

.... 2.568 

3.089 

15. . . 

.... 2.705 

3.207 

20. . . 

.... 2.799 

3.289 

30. . . 

.... 2.928 

3.402 

50. . . 

.... 3.082 

3.539 

100. . . 

. . . . 3.283 

3.718 

1000. . . 

. . . . 3.884 

4.264 


JExam'ple 5. Suppose that in the mass production of a certain article the manufacturer 
aims at an average breaking strength of 180 lb with a standard deviation not exceeding 12 lb. 
In a sample of 10 items the lowest breaking strength would be below 180 — 12X3.089 = 142.9 
lb only once in 100 times, on the hypothesis that the sample is a random one from a normal 
population with mean 180 lb and standard deviation 12 lb. Hence the occurrence of such 
a sample might well warrant investigation of the process. 

The chance that in a sample of N there will be one individual at Xi, one at 
and JV — 2 in between, is given by 

N{N -1){Fn- dFx dFN 

where Fi stands for F(xi) and Fn for F(xn)j and where dFi == f(xi) dxi. This 
is because there are iVCiV -- 1) ways of fixing the order of appearance of the 
greatest and least individuals within a random sample of N. The greatest 
member might be the first, second, or any one up to the iVth, and the smallest 
could appear in any of the remaining N — 1 positions. For this sample, the 
range is given by 

w = xn — Xi 

Putting xn = Xi + w, and noting that for a fixed value of xij dx^ = dw, 
we have as the joint frequency function of xi and w 

NiN - l)[F(xi + w)- F(xi)r-J(xi + 1 /;) f{xd 

Integrating over all values of Xi we obtain for the frequency function of the 
range (0 < w < oo) 

(7.87) g(w) = N(N - 1) FiFCx, + w) - F(xOr-^f(xi + w) f(x,) dx. 
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The distribution function for the range is 


(7.88) Giw) = J g{w) dw 

Since the order of integration may be reversed, and since 
f(xi + w) dw — d[F(xi + w) — F{x{)] 

we find 

X oO 

f(xi) dxi 1 [F(xi + w) - X 

d[F(xi + w) — F(xi)] 




[F(xi + w)- F(xi)r-^f(xi) dxi 


This formula, although apparently simple, is not in general of much use, 
because of the dijficulty of expressing F{x + w) in terms of F{x). Tippett 
has calculated the expectation of w as 

(7.90) Eiw) = ^”[1 -F^ - (1- F)^] dx 

and has evaluated this expression by numerical integration for values of N 
from 2 to 1000 when the parent population is normal [that is, when F == ^(x)]. 
His results are reproduced as Table XXII of Pearson’s Tables for Statisticians 
and Biometricians j Part II. 

If the parent population is rectangular ^ so that fix) = 1, 0 < rr < 1, we 
have F'{x) = x(Q < x < 1) and F{x) = l{x > 1). Therefore 


G(w) - N 


ri-u> 

j 


dx + N I (1 - dx 


(7.91) = -w) + w^ 

If the parent population is normal^ fix) = <#>(x), F{x) = ^>(x), and 

G(w) = N J* <t)(x)[^ix + la) — ^(x)]^^^dx 

Now, putting X = — — 16 ?, we have 

r —w/2 

<j)ix)[Mx + w) — ^ix)]^'^^dx 


-i: 


<i>(y + w)[^{—y) — ^(—y — w)]^-^ dy 


because <t>(x) = ^(— x). Also, 

${— y) — #(— 2/ — w) - ^(y + w)~ ^{y) 
so that on replacing the variable of integration y by x we obtain 
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<l>(x)l^(x + w) — f>(a:)]^-‘ dx = y 
Consequently 


w/2 


<f>(x + w)[^{x + 2 ^;) — dx 


(7.92) G{w) = N f [^(x) + (t>{x + w)\[^ix w) — ^ix)Y'~^dx 


N I [<l>{x 4- If;) — (l>(x)][^(x + w) — <l>(x)]^”'^ dx 


’i 

'£ 


+ 2iV’ I <j)(x + w)[^(x + w) — ^(a:)]^”"^ rfa; 

f-w(2 


The first integral is equal to 


The second is equal to 

2N f <t>(u)[^(u) — ^(u — w)]^~'^du 
Jw/2 


where u — x + w. Values of G{w) may be calculated by numerical methods. 
H. 0. Hartley has given an accurate 4-place table of G{w) for all values of N 
from 2 up to 20. For sample sizes larger than 20 the range is of little value in 
practice and the distribution is so sensitive to slight variations from normality 
in the parent population that the tabulated values cease to be trustworthy. 

The principal use of the range as a measure of variability is in the construc- 
tion of charts for quality control in the process of manufacturing certain 
articles, and sample sizes of 5 or so are quite common in such use. (See 
Ref. 6.) 

7.20 Confidence Limits for the Binomial and Poisson Distributions. The 
usual method of obtaining confidence limits for the proportion of individuals 
in a population having a certain characteristic is to assume that the sample is 
large enough to justify the normal approximation to the binomial, This 
method was described in § 6.15. It assumes also that the sample proportion 
p is an adequate estimate of the true proportion B, and this may not be true 
for sample sizes less than, say, 100. 

A more accurate procedure is to use Table VIII 1, in Statistical Tables by 
Fisher and Yates. If an event is observed to occur a times in N trials, the 
observed proportion is p = a/N, If 6i is the lower 100(1 — a)% confidence 
limit for the true proportion Bj then if B were really equal to Bi an observed 
number of successes at least as great as a would occuj by chance with proba- 
bility P = a/2. Similarly if Bu is the upper 10Cy(l*— a)% confidence limit, 
then if B were really equal to Bu an observed number of successes at least as 
small as a would occur by chance with the same probability P. The corre- 
sponding limits of expectation for a are ai = BiN and au = BuN. These are 
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tabulated for three values of P (0.005, 0.025, and 0.1), corresponding to 
a = 0.01, 0.05, 0.2 respectively, and for values of a from 0 to 10, and of p 
from 0 to 0.5. The values for p = 0 give the limits of expectation for the 
Poisson distribution. 

Thus suppose a = 2 and JV = 5, so that p = 0.4. The standard error of 
p by the usual formula is [(0.4 X 0.6) /5]^^^ = 0.2191, and the 95% limits 
obtained by (6.35), with ta = 1.96 are 0.922 and —0.035. 

The limits obtained by the still simpler formula 0.4 ± 1.96 X 0.2191 are 
0.829 and —0.029. The correct limits as given by Table VIII 1 are 0.853 
and 0.053. (The table gives values of ai and a^, which must be divided by N 
to give 6i and du-) 

A more extensive set of tables for the same purpose has been prepared by 
D. Mainland. For all values of N from 2a up to 30 and by increasing steps 
up to 1000, three sets of confidence limits are given. If a > N/2, we use 
d' z=z — a, instead of a. Less complete tables are given for a > 20. With 
the help of these tables it is an easy matter to determine confidence limits 
corresponding to any sample proportion likely to occur in practice. 


Problems 

1. In a certain observed distribution, iV = 20, x == 42, s = 5. Test the hypothesis that 
this distribution is a random sample from a normal universe with mean of 50. 

2. In a certain test, one section of 20 students had an average score of 40 with a standard 
deviation of 5. Another section of 25 had an average of 46 with standard deviation of 4. 
Does this indicate a significant difference in the two groups? What assumptions do you 
make in applying the test? 

3. In an experiment in industrial psychology a job was performed by one group of 30 
workmen according to Method I and by a second group of 40 according to Method II. (The 
groups were independent and equally efficient.) Are the following distributions of the time 
(in seconds) taken such as to justify the conclusion that Method I is the speedier of the two? 
Use the difference between the means as a criterion of judgment. 


Time 

I 

11 

50 

1 

0 

51 

3 

1 

52 

5 

2 

53 

4 

5 

54 

7 

8 

55 

5 

9 

56 

3 

6 

57 

1 . 

3 

' 5& 

1 

3 

59 

0 

1 

60 

0 

2 

Totals 

30 

40 
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4 . From the separate distribution functions of f and s derive the distribution of “Stu- 
dentes z, and from that obtain the function /„(<). 

Hint Let z ^ z/Sj u - + s^, and find the joint distribution of z and u. Then inte- 

grate out u. 

6. Prove that/«(0 is asymptotically normally distributed. 

6. Write out m full the derivation of Fisher’s z function, g(z) of (7.70), as outlmed in 
§ 7.14. 


7. Verify the Romano vsky expansions of b(N) and kiN) in series, as quoted in § 7 8. 
Hint. Prove that p ^ ^ ^ 


m 


w-(|) 


) = ^ (®ee Chapter III). Then 

[Kfff 


2^-2 


(N - 2)!Vx 


Put r ^ 2)! — expand the factorials by means of 

Stirling’s approximation, using three terms of the series. k(N) can be verified only as far 
as the second term, unless four terms are used in the Stirling series. (The fourth term is 
-139/51840i\r^) 

„ ET / P\~~l 

8. Show that by the change of variable z = ( 1 -| — — ) , the F-distribution of (7.71) 

becomes a Pearson Type I distribution. 

9. The breaking strengths of 10 specimens of 0.104-in. diameter hard-drawn copper wire 
are found to be 578, 572, 570, 568, 572, 570, 570, 572, 596, 584 lb. Calculate the 95% 
confidence limits for the breaking strength of this kind of wire. Ans. 569 and 581 lb. 

10 . In the course of archaeological investigations conducted at a certain site, 16 lo’wer first 
molars were found with mean length 13.57 mm and standard deviation 0.72 mm. From a 
near-by site, 9 lower first molars were taken with a mean of 13.06 mm and a standard devia- 
tion of 0.62 mm. Is the difference in mean length compatible with the hypothesis that the 
two finds are samples of the same population? 

11 . Calculate 98% confidence limits for the variance and for the standard deviation of 
the breaking strength m samples of 10 of the pieces of copper wire mentioned in Problem 9. 

12. Suppose that a number of measurements are made in duplicate. Prove that the 

standard deviation of a sample of 2 is given by s = i 1 — Z 2 1, where Xi and X 2 are the 

duplicate measurements. Hence show that an unbiased estimate of the standard deviation 
of the population of such samples is a = 0.8862 times the mean value of | JTi — X 2 1 for the 
samples measured. 

Hint Use (7.40). 

13 . In a series of 6 duplicate plate counts of molds on butter, the following results were 
obtained: 


Sample No. 

Count (1) 

Count {2) 

1 . . . . 

. . . 1400 

1600 

2 . . . . 

. . . 4100 

2700 

3 . . . . 

... 900 

1100 

4 . . . . 

. . . 6800 

7400 

5 . . . . 

. . . 3800 

4200 

6 . . . . 

. . . 7100 

.6000 


Estimate the standard deviation of the population of duplicates, by the method of Prob- 
lem 12. Compare this With the actual standard deviation of the differences between dupli- 
cates. 
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Use the t test to determine whether there is any significant difference between duplicate 
counts. 

14. Two chemists A and B repeat a protein analysis 20 times. If Xi and Ft are the 

values obtamed by A and B respectively, and if = 196.40, — 1928.6560, 

^Ft = 205.16, “ 2104.7152, determine whether there is a significant difference 

in precision between the two sets of analyses, the precision being measured by the inverse 
of the variance. 

15. In two series of hauls to deternoine the number of plankton organisms inhabiting the 
waters of a lake, the followmg results were found: 

Series I: 80, 96, 102, 77, 97, 110, 99, 88, 103, 108 

Series II: 74, 122, 92, 81, 104, 92, 90 

In series I the 10 hauls were made in succession at the same point. In series II the 7 
hauls were made at points scattered over the lake. Do the observations suggest any greater 
variability between different places than exists at the same place? 

16. Twelve hogs were fed on diet A, 15 on diet B, The gams in weight for the individual 
hogs (in pounds) were as shown: 

A' 25, 30, 28, 34, 24, 25, 13, 32, 24, 30, 31, 35 

B: 44, 34, 22, 8, 47, 31, 40, 30, 32, 35, 18, 21, 35, 29, 22 

What conclusions may be drawn from this experiment? 

17. An observer made the followmg observations on the vertical diameter of the planet 

Venus (m seconds of arc). 42.70, 42.56, 43 01, 43.48, 42.76, 43.06, 43.63, 42.87, 41.60, 42.78, 

42.95, 43.20, 43.18, 43.39, 43.10. 

Assuming that the population of readings is normally distributed about a true value which 
is estimated by the arithmetic mean, calculate 95% confidence limits for the vertical diam- 
eter of Venus. 

From Table 11 show that the probability that in a sample of 15 the smallest value would 
be at least as small as 41.60 is nearly 0.05. Show that the probabihty of a single reading 
being as low as this is about 0.007. (Use the ^-distribution.) 


Weight (pounds) 
Class Marks 

Frequency 

Boys 

Gtrls 

42.5 

1 

0 

48.5 

3 

1 

54.5 

9 

7 

60.5 

33 

37 

66.5 

65 

41 

72.5 

80 

59 

78.5 

72 

58 

84.5 

41 

48 

90.5 

27 

23 

96.5 

7 

26 

102.5 

4 

16 

108^5 

2 

6 

114.5 

1 

3 

120.5 

0 

2 

Totals 

345 

326 
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18. A question arose in a physical education class as to whether eleven-year-old girls 
weigh, as a rule, more than eleven-year-old boys. Suppose you wished to make a thorough 
analysis of the data in the table on page 196 concerning weights of boys and girls aged 
eleven. Describe the tests you might apply, the reasoning and assumptions underlying 
these, and the interpretation that might be placed on the results 

The following points are suggested for discussion: 

(a) Is there a clear difference between the two distributions? How would you test this: 
from the means, from the variances, from the samples as a whole? 

(5) 32.3% of the boys and 26.4% of the girls have weights less than 69.5 pounds. Is this 
difference significant? 

(c) Within what limits would you say that the mean and standard deviation in the popu- 
lation of eleven-year-old boys (from which you have the sample of 345) is almost certain to 
lie in each case? 

(d) Summarize your results. 

19. Us# the method for obtaining approximate standard errors (commonly known as the 
delta method) of |.6.10 to find the standard errors, to terms of order 1/N, for gi and g 2 for 
samples of N from a parent population having a Gamma (Pearson Type III) distribution 
with frequency function /(a:) = e“®a;^“^/r(X). Show that to this approximation 

Var (^i) = 6 N-Kl + 6 A + 

Var (^ 2 ) - 24 + 42 A + 167 A"* + 126 A®) 

The skewness of this distribution is 2X“i/®. 

Hint. To terms of order N'^\ gi = V^, p-a = &2 — 3. Use (C.18) and (6.19). 
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CHAPTER VIII 


LUfEAR REGRESSION, SIMPLE CORRELATION, AND CONTINGENCY 


8.1 Linear Regression. Let x and y be two variables with the joint prob- 
ability density function fix, y) and the marginal density functions g{x) and 
h{y) respectively. If y is fixed between y and y + dy, the probability that x 

fix v) 

lies in the interval x to x + dx is dx^ since j /(x, y) dx = hiy). 


The 


It is the 


f ix ti) • 

ratio is called the conditional probability density of x, given y. 

probability density for an array of x^s all having the same value of y. 

In the same way, the conditional probability density of y^ given x, is 

since \ fix, y) dy = gix). It is the probability density of an array 
Q \^) ^—00 

of y’s all having the same value of x. 

The notion of arrays may be made more concrete by thinking of a joint 
distribution of the heights and weights of men. If x refers to weight and y 
to height, then an example of an x array of y^s is the distribution of the heights 
of all men who weigh 150 pounds, and the weights of all men who are six feet 
tall is an example of a y array of x^s. 

The expectation of an x array of y^s is 


( 8 . 1 ) 


-/■ 


= / yfix, y)/9ix) dy 


integrated over all values of y in the array defined by x. The variance is 
given by 


( 8 . 2 ) 






•fixTfix, y)/gix) dy 


The locus of 17 * as a function of x is called the true regression curve of y on x. 

If the equation of the regression curve is of the form 

(8.3) Tj, - a + ^x 

the regression oiy on x is said to be linear. Similarly, if the equation of the 
true regression curve of x on y is ^ * 

(8.4) fy = a' + /5V 

the regression of x on y is linear. If one regression is linear it does not follow 
that the other is linear also. 
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From (8.1) and (8.3), 

(8.5) J yf(x, y) dy = ag(x) + Pxg(x) 

and on integrating both sides with respect to x, we have by (4.6) and (4.7) 

(8.6) 1^01 = QJ + 

Multiplying each side of (8.5) by x and integrating, we have sinailarly 

(8.7) vii = ocvio + 

Solving (8.6) and (8.7) together for a and we obtain 


( 8 . 8 ) 

and 

(8.9) 


/3- 


rii — rioJ'oi 


J'20 


VlQ‘ 


Mil _ P£y 
M20 O'x 


a = vai — Viop<Ty/(Tx 

The equation of linear regression can, therefore, be written 
(8.10) rix — voi = p — (x — no) 

(Tx 

In the same way, from (8.4), 


( 8 . 11 ) 

The quantities ^ add i 


— Vio = p — (y — roi) 

are called the regression coefficients. Their product 
- is p^. 

Example 1. Given 



M y) = ■ 


0 < a; < y 
0 <y <a 

as the joint probability function of two variables 
X and y. Find (i) the marginal totals g(x) and 
My); (ii) the mean and variance of each of the 
marginal totals, i.e., viq and <r** = p 2 o for g(x)j vox 
and try® = fio 2 for h(y); (iii) the equations of the 
regression curves of y on a? and of x on y, ij® and 
^v; (iv) the correlation coefficient p. 

Solutions. The volume under the surface 
represented by the given function is unity. 
Thus 




The surface is shown above, 
(i) The marginal totals are 


h(y) = 


x) 
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(ii) The means are 



r:o=j 

'o a» 

(a - 

x) dx 

a 

“3 


^01 = i 

*J 

rvH 
*0 ^ 

dy = 

2a 
'' 3 


Since 








-a 2 

3 a® 

(a - 

x) dx 

11 



5 ^ a® 

ydy 

11 


the variances are 









a® 

a® 

a* 


P20 = 

J-a,® « 

6 

9 

18 




a® 

4a® 

a* 


P02 = 

(Ty® = 

2 

9 

' 18 

(iii) The regression lines 

are 






r 

2/a« 


a +» 



2(a - 


2 


fV 

2/a» 

2v/o» 

dx ~ 

U 

2 



(iv) From the equations of the regression lines it follows that p® = J and p = i since 
p(<ry/flr,) is positive. 

8.2 The Standard Error of Estimate. The expectation, over all x arrays, 
of the variance try;/, weighted with the marginal distribution of x, is usually 
called the ‘Variance of estimate’^ and will be denoted here by It is not 
a squared standard error in the ordinary sense,* being a population parameter 
instead of an estimate based on a sample. By delSnition, 



(y “ y) dy dx 
By (8.10), if the regression is linear, 

(y — VxY = — j'oi)^ ~ 2p — (y — voi)(x — vio) + (a; — Pio)^ 

CTx Oju 

so that on integrating term by term we get 
(8.12) = <r»* — 2p ^ Mu + 

= - p*) 

It is evident from this result that 



-1 <P< 1 


See } 8.7 and { 8.0 for a definition of the real standard error of estimate. 
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8.3 The Normal Correlation Surface. We shall now consider a joint 
probability function of special interest. The normal correlation surface is 
defined by the following function 

(8.13) f(z, y) = Ke-^ 

where 

2(1 — ax(Ty cTy^! 

^ = 2ir<rx<ryil — 


and the variables x and y have the origin of their reference system at their 
respective means, that is, 


(8.14) 


Vio = r ^g{x) dx == 0 
Vox = / yh{y) d?/ = 0 


The conditions (8.14) may be imposed without essential loss of generality 
and will simplify the algebraic discussion. 

The marginal distribution of x is given by 


g{^) = J" y) dy 

= f dy' 




(Tx 




Similarly, the marginal distribution of y is 

Hy) = j fix, y) dx 

-I 

g — 1^l2<ry^ 


Hence we may state 




Theorem 8.1. If two variables are normally correlated^ each variable is 
riorrnally distributed in its marginal totals. 
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That the converse is not necessarily true is shown by the following illustra- 
tion. Consider a clay model of a normal correlation surface such that its 
marginal totals are necessarily normal distributions by 
the above theorem. Quantities of the clay can be redis- 
tributed by piling up in certain spots the clay that is 
scooped out in other spots in such a way that the mar- 
ginal totals are not disturbed. It is obvious that the re- 
sulting surface is not one that is defined by (8.13). 

Other interesting properties of normally correlated variables are described 
by the following theorems. 

Theorem 8.2. The regression systems of a normal correlation surface are 
linear. 


[S \Z2 
E] m 


The proof is a matter of integration. Let us find the probability function 
of an a; array of y’s. By definition, this is given by f(x, y)/g{x). To get the 
mean of such an array we must multiply its probability distribution by y 
and integrate over all values of y in the array. Thus we have 


IJx = 



yf(x, y) dy 

gi^) 


1 

crAMl - P^)V'^ 




Xpffy 

CFx 


If X is allowed to vary over the arrays, it is evident that the locus of the means 
of the X arrays of is the line 


(8.15) 


nx = 


XpO" y 
(Tx 


In a similar way the mean of a array of x’s is given by 





xf(Xy y) dx 
h(y) 


ycTxP 

Cy 


and this lies on the regression line 


(8.16) 


%v 


Uy 


While it is an intrinsic property of a normal correlation surface that both 
regressions are linear, one should not infer that this is characteristic of joint 
probability fimctions in general. One or both or neither of the regression 
systems of a joint probability function may be linear. The student 
will observe that the definition of the correlation coefficient did not involve 
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the condition that f(x, y) was normal nor that regression was linear. Al- 
though the definition of a correlation coefficient does not require linear regres- 
sion, nevertheless the correlation coefficient may not be a good measure of 
relationship if the regression is definitely non-linear. 

Theorem 8.3. If x and y are normally correlated, then each array is a normal 
distribution, each x-array of y’s has the same variance and each y-array of x’s 
has the same variance oa?. 

The proof consists in exhibiting the frequency function for an x array of y’s 
and for a y array of a:’s. Thus, for the first case we have 

exp [ - {y - pxiaj af)Y 

g(x) v2xcrey 

where o-ey® = (Xy^{l - p^). Evidently, this is a normal distribution with 
variance which is independent of x and therefore is constant over all x 
arrays. It is left as an exercise for the student to give the companion proof 
for the arrays in the y direction. 

When the variance is constant over the arrays in the x direction the regres- 
sion system oiy on x is said to be homoscedastic (equally scattered). Similarly 
for the y direction. A geometrical representation of a normal correlation sur- 
face is given in Part I, § 18 of Chapter VIII. 

8.4 L imiting Forms. Suppose a plane is passed through the surface defined 
by (8.13) parallel fo the x^-plane. Analytically, this means that we let 
f{x^ y) — c where c is some constant less than the maximum value of the func- 
tion, that is, we take 0 < c < K to insure a real intersection. We obtain 


(8.17) 

2 

^pxy 


(T x(^ y ^ 

where 


2(1 -p*) log. f 

(8.18) 

X2 = 


which is obviously not negative. Thus the points (r, y) for which the probar- 
bility density is constant lie on an ellipse. 

It is easier to study (8.17) if we transform the variables to standard units 
by letting and ty — y/cy. Then (8.17) becomes 

(8.19) Q - 2ptjty + ty^ - 

The cross-product term will vanish under the transformations 

is — u cos ^ — V sin ^ 

, #y = w sin 0 cos 6 

when $ = ir/4. So the required rotation formulas are 

/o ^ u j u + V 

( 8 . 20 ) fx — ^ 2 ) 1/2 ( 2 ) 1/2 
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Applying these to (8.19) we obtain 

(8.21) uKl - p) + vHl + p) = 
which may be written in the standard form of an ellipse 

( 8 . 22 ) 


~ j. ~ 
62 


1 


with semi-axes a and 6, where 

X2 


and 


62 = 


X2 


1 +P 


The eccentricity of the ellipse (8.22) is (1 — = [2p/(l + p)]'^“. 

We see that 6 •-> a as p — > 0. When p = 0, 6 = a = X. Then (8.22) would 
be a circle, and (8.13) would be a surface of revolution if the variables were 
expressed in standard units. WTien p = 1, it follows from (8.21) and (8.18) 
that V = 0. From (8.20) it is seen that the line z? = 0 is the same as iy = tx, 
and the ellipse has degenerated into a straight line. The surface then shrinks 
into a normal curve in the plane ty = 4* 

8.5 Tetrachoric Correlation. The word tetrachoric refers to a 2 X 2 fold 
table. Suppose N objects are classified according as they possess one or 
both or neither of two qualitative traits or attributes which may, for conven- 
ience, be denoted by I and II. Such a classification will yield a four-fold 
table as shown in Table 12, 


Table 12 



Not II 

II 

Total 

Not I 

a 

b 

a -\-b 

I 

c 

d 

c d 

Total 

a + c 

b + d 

N 


where a-t-6 + c + d = iV', the four classes being mutually exclusive but not 
necessarily exhaustive. The attributes may sometimes admit also of quanti- 
tative measurement, but we are considering only the case where they are 
classified dichotomously (that is, in two classes), such as ^^tall ” and ^'not tall 
“male^^ and 'Temale,'’ alive and ^Mead,’' “good^^ and ^‘bad,^’ ^^dulF' 
and ‘‘not duU,^' etc. An example is the following classification of 26,287 
children where attribute I is dullness and attribute II is developmental defects. 

The problem in such classifications is to measure the intensity of association 
between the two attributes in the set. Let us suppose that our data had been 
given initially so that a fiine division into many cells was possible and that the 
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TABiiE 13. (K. Pearson, Tables , p. li) 



Without 

Defects 

With 

Defects 

Totals 

Not Dull 

22,793 

1,420 

24,213 

Dull 

1,186 

888 

2,074 

Totals 

23,979 

2,308 

26,287 


result would have presented a normal correlation surface. If this surface 
were then divided into four cells by planes x ^ h and y = fc to yield the rela- 
tive frequencies observed, then the correlation coefficient that characterizes 
this normal correlation surface is called tetrachoric r. It will be denoted by Vt. 
It is the correlation coefficient for the normal surface that reproduces the data. 

Karl Pearson and Alice Lee have given tables for determining r^. (See 
Tables for Statisticians and Biometricians j Part I, Table XXX. Also fuller 
tables in Part II, Tables VIII and IX.) 

Suppose the 2X2 table is arranged (as it always can be) so that a + c> 
h + d and a + b > c + d. We now find h and k so that 


^ dt, ^ = jr%(o dt 



I[il=y^(t)dt, ^=/5(t)dt 

h k 


Fig. 23 
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h and k being, therefore, positive numbers (Figure 23). We then calculate 
d/N and interpolate in the tables to find corresponding to the given h, k, 
and d/N. The procedure involves a double interpolation to find d/N for each 
of two near values of rt and a final interpolation between these two values. 
Thus for Table 13 we find h = 1.354, k = 1.413, d/N = 0.03378. For 
Tt = 0.65, d/N “ 0.0337 approximately; for = 0.70, d/N == 0.0371 approxi- 
mately; so that finally rt = 0.650. 

An approximate simple method of finding Vt, useful when | rf [ < 0.8, has 
been given by Camp.^ This avoids most of the labor in interpolation. 

8.6 Linear Regression as Estimated from a Sample. Let {x^, y/) be cor- 
responding sets of values of x and y, where f = 1, 2, • * • A. We will as- 
sume that the Xt are fixed numbers, whereas the yi are random variables. 
We assume also that the true regression is linear, given by equation (8.3), and 
that the yi are independently and normally distributed about the regression 
values with variance aey^, the same for all values of x. Hence, if 

(8.23) rjt = a + ^x^ 

we have 

(8.24) 2/t = ’/i + 

and the ei are independent normal variates with expectation zero and variance 

The method of least squares determines a straight line 

(8.25) F == a + 


such that the sum of squares of the distances of the points {x^, y/) from this 
line, measured perpendicular to the x-axis, is a minimum. The a and h so 
determined are estimates of the a and of the true regression line. 

Writing 

-s = -a- bx^ 


and putting 



dS 

db 


= 0 


we have as the conditions for a minimum of S 


(8.26) 


f 


l2 


I — bx^) = 0 
^i(y% — a — bxt) = 0 


or 

(8.27) 


f Na + ^xd> = . 

1 ^x^a + '^x^’‘b = '^Xiyt 


From these two equations (known as normal equations), a and b may be 
determined. We find, on writing 
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= Nx, ^2/, = Ny 

= N(sx^ + x^), 2?/,* = N(sy‘‘ + y^) 

= Nxy 

that 

(8.28) 6 = (x^ — xy)/sx^ = rsy/sg 

where r is Pearson’s coefficient of correlation for the sample, and that 

(8.29) a = ^ 

The equation of the ^^best-fitting” straight line, in the sense described 
before, is, therefore, 

(8.30) Y -y = b(x-x) 

and so passes through the point (x, y). This line is called the sample regression 
line of y on x, or the trend line of y on x. It is used to estimate y for a given 
value of X. 

The minimum value of S is given by 

(8.31) (Snun = - y) - Ux; - 

= iV'(5/ + — 2rhsxSy) 

= NsyHl - r 2 ) 

so that Smin/A^ is an estimate of <rey^ as given by (8.12). It is not, however, 
unbiased. We shall now prove that an unbiased estimate is provided by 
S^J{N - 2). 

8.7 The Sample Standard Error of Estimate. From (8.23), (8.24), and 
(8.25), 

(8.32) yi — Y^ = e^ - (a — a) — (b - ^)x^ 

Also, from (8.29), (8.23) and (8.24), 

a + &x = ^ = a' + jSx + e 
where e is the sample mean of the ei, so that 

(8.33) a - a + (5 - ;8)f = e = i 2®* 

From (8.28), 

6 = 2^2/. - y)iX^ — X)/NS;c^ 

“ '^^^yi(,X{ ~~~ x)/N^Sx^ 

= x)(a + /3x, + ei)/NsJ‘ 

= 2(^» —’x)[a + )3(x» — x) + fix + et]/Nsx^ 

(8.34) - fi + 2®»(^* ■“ x)/Nsx^ 

since '^(x, — .r) = 0 and 2(^* Hence b is normally dis- 
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tributed about ^ with variance crey^/NsJ^- It follows similarly from (8.33) and 

(8.34) that a is normally distributed about a with variance aey^isj^ + x^)/Nsx^- 
We have, therefore, 

y^ - Y^ = Cz - e - {h - p)(x^ - x) 

= — a — (xt x)^e^(x^ — x)/Nsx^ 

(8.35) = — e — eh{x, — x) 

where 

eh = — x)/Nsx^ = 5 — ^ 

and so 

^(y^ - F,)2 == + NehHx’^ - 2eb^e^ixi - x) 

- + NehHx^ - 2N€hhx- 

Now is a sum of N squares of normal standard variates, and 

is a linear function of these variates. Also N^^^xCb/crey = 
is an independent linear function of the same variates 
since it is easily verified that these two functions are orthogonal. Hence by 
Theorem 5.10 ^{y% — is independently distributed as with 

N — 2 degrees of freedom. 

Since the expectation of is equal to the number of degrees of freedom, it 
follows that 

E{'X(y^ - I'O'} = iE - 2)aey^ 

Hence — Y,y/{N — 2) is an xmbiased estimate of Denoting this 
estimate by we have from (8.31) 

(8.36) V = NsyHl - r^)/(N - 2) 

S-ey is an estimate, from the sample, of the quantity called in § 8.2 the '^stand- 
ard error of estimate. The loss of two degrees of freedom is suggested by 
the fact that two constants a and b of the regression line have been calculated 
from the sample. 

When a and h have been determined from the normal equations, the value 
of ^(y% — Fi)2 may be estimated from the following equation: 

(8.37) - F.)^ = -a- bx^)^ 

= — a - bxO» by (8.26) 

= - b^x^y., 

8.8 Confidence Limits for the Constants of the Regression Line. The 

variance of h is a-ey^/NsJ^, so that an independent and unbiased estimate of the 
variance is, from (8.36), given by 
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(8.38) 

Hence the ratio 

(8.39) 


Var (i») = 


s/ 1 - rg 
s/N -2 


h- 1 - 
N -2 


t = 


r(b - 0) 
b 



has the Student distribution with N ~ 2 degrees of freedom. For the im~ 
portant case p = 0, t — r[(iV’ — 2)/(l — r-)]*/^, and thus is independent of 
the population parameters. 

By choosing suitable values of t in (8.39), limits of j3 corresponding to cer- 
tain assigned degrees of confidence may be calculated. 

The variance of a is cre„^(si® -f x^)/Nsx^, which is estimated bj 


Therefore 



(Sx^ + X^-) 


bHl - r^) 

(N - 2y 



N -2 ) 

(1 — -b x^) / 


has also the t distribution with N — 2 degrees of freedom, and this fact may 
be used in calculating confidence limits for a. 


Example 2. For 29 families living in Edmonton, Alberta, x represents the dollar income 
per adult unit per week m 1936 and y the calones m food purchased per adult unit per week 
(the size of family was expressed in terms of equivalent “adult units”). From the data 

obtained, = 201.9, ^ 1668.6, * 6.7060 X 10«, = 1.6357 X 10^ 

and — 4.9861 X 10®. The calculated regression line is F — a -f 6x, where a and 6, 
from equations (8.27), are 14,387 and 1255 respectively. '^(y% — FOS from (8.37), is 

4.517 X 10®, so that aey — 4090 Also = 252.9, so that the standard error of b is 

4090/(252.9)^^^ = 257.2. Taking t = 2.052, we have, as the 95% confidence limits for 

- 5 ± 2.052 X 257 2 
= 1255 ± 528 

Hence the slope of the true regression line may be taken with 95% confidence as lying 
between 727 and 1783 calories per dollar. Since N(sz^ -f x*) = ^x® = 1658.6, the stand- 
ard error of a is 4090[1658.6/(29 X 252.9)]i/2 = 1945, so that a = 14,387 + 3991. Tho 
true value a may, therefore, be regarded, with 95% confidence, as lying between about 
10,400 and 18,380. 


The assumption made in Ex, 2 that the true variance of y is the same for 
all X is probably not realistic. One might well expect the variance to increase 
with increasing x. The alternative assumption that (Ty\x/y is constant would 
mean calculating the regression of log y instead of y on x (since d log y 
^5y/y). (See §9.6.) 

It may be observed that if the values of x are at our disposal, as is some- 
times the case in planned experiments, they may be chosen so as to minimize 
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the variance of 6. TMs variance varies inversely as If, for example, we 
have 2N observations covering the range from x = — a to a; = +a at equally 
spaced intervals, the value of is [(2iV' + 1)7(2^^ — l)](aV3), but if we 
make N observations at —a, N at +a, and none in between, sj^ = The 
ratio of the variance of ft in the second case to that in the first is (2N + 1)/ 
(QN — 3), which for iV > 1 is always less than 1, and for large N approaches 
I*. It is assumed, of course, that the regression is definitely known to be 
linear. If we want to test the linearity, we must space out our observations. 

8.9 Confidence Limits for the True Regression and for an Estimated y 
Corresponding to any Given x. From (8.24) and (8.36), 

(8.40) F — = a + eh{x — x) 

= ~x){x- x)/sj^] 


Hence F is normally distributed about ri with variance 

+{X- X)ySx^] 

Replacing by its estimate aey'^j we have 

(8.41) s.e. (F) = + (x - xY/sx^Y^^ 


This is the real standard error of esti-^ 
mate. 

Since F ~ divided by this standard 
error has the ^-distribution with iV' ~ 2 
degrees of freedom, confidence limits 
for 7} may be calculated in the usual 
way. It should be observed that the 
standard error of Y increases as (x — xy 
increases, so that the curves bounding 
the confidence intervals for different 
values of x are hyperbolas. (See Fig- 
ure 24.) 

If, however, we are interested not 
so much in F itself as in the difference 
between an actual y corresponding to 
a given x and the predicted F for the 
same x^ we must take into account also 
the vari^ition of y. Since y is normally 
distributed about tj with variance (Te/, 
independently of F, we have 

(8.42) 



(x — x)^ ' 
Nsx^ 


Var {Y — y) = Var F + Var y 

= [i+^+ 
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Since the expectation of F — 2 / is zero, it follows that the ratio of F — y to 
+ N~^ + (a: — z)^/Nsx’‘Y'^ is distributed as Student’s t with N — 2 
degrees of freedom. Hence confidence limits for y may be determined. They 
are given by 

a + bx ± + N~^ + (a; — xy/Nsx^Y’^ 

Example 3. For the data of Example 2, aey = 4090. At rr = 5 = 6.962, the standard 
error of F is 760, while at a? == 15 it is 2203. The 95% confidence limits for 2 / at a: = x are 
23,124 ± 8536 calories while at x = 15 they are 33,211 ± 9534 calories. 

8.10 Confidence Limits for x, Given t/, When the Regression Is Calculated 
for Fixed x. It sometimes happens that we desire to estimate x for a given 
t/, even though we have calculated the regression for fixed values of x. We 
may, for example, msh to estimate the median lethal dose of a drug (that is, 
the dose that will kill 50% of the time) from observations on the proportions 
killed with various known doses. 

Writing y' = y — x' — x — x, \ — I + l/iV, we know from § 8.9 that 

(8.43) (y' ^ hx')[hy(\ + 

has the Student ^-distribution with N ^ 2 degrees of freedom. This equation 
may be regarded as a quadratic in x\ namely, 

- £ 2 ) _ 2xy/h + 

where = hvHyiNhW). 

The estimated x' for a given y' is y'/h^ so that, on denoting this estimate 
by X', we have 

(8.44) fix') = x'Hl ~ _ 2X'x' + X'2 - = 0 

For a given value of f, say ta, the roots of this equation supply confidence 
limits for x'. Since, for values of rr' between the a% confidence limits, | f I is 



Fig. 25. (a) < 1, (b) B^ > 1. The Confidence Intervals Are Indicated by 

Double Lines. 
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less than ta, it follows that y' — hx' < iJ^eyO^ + and hence that 

fix') < 0 when x' lies between the confidence limits. 

The practically important case is that when < 1. In this case the two 
roots are real and the confidence interval lies between them. Note that 
5^ < 1 implies that 6 is greater than times the standard error of 6, which 
means that 6 is different from ^ero at the level of significance represented by a. 

If > 1, the roots of (8.44) may be either real or complex. If they are 
complex, no confidence limits exist, except — qo to oo , and if they are real, 
say Xi and 0 : 2 ', {xi < xz), the confidence limits are from —00 to xi and from 
X 2 to +OC) , since these are the regions in which /fx') < 0. The two cases are 
illustrated in Figure 25, 

8.11 Linear Regression with the x^s and y^s Both Subject to Error. An 
appropriate mathematical model for this case is the set of equations 

(8.45) x^ — ^^ + d^ 

(8.46) yi == 111 + 

(8.47) = a + 

where d^ and e^ are normally and independently distributed about zero with 
variances crd^ and respectively. 

If the are considered fixed values j x and y have a joint bivariate normal 
distribution given by 

Six, y) = exp [- 

For a sample of N pairs of observations, the likelihood (defined as the log- 
arithm of the probability function for the actual sample) is given, apart from 
a constant, by 

L = - JVlog <rd - JV log ^ ^ - a- 

This expression contains AT + 2 unknown parameters, namely, a and ^ and 
the N values of If L is to be a maximum relative to all these parameters, 
we must have dL/da = 0, dL/d^ = 0, = 0, i = 1, 2, • • • AT. These 

equations give 

^iy% — a — jdf i) = 0 

(8.48) ( 2 ^* — a — = 0 

ix^ - fi) Ad' + fiiy^ -a- = 0, i - 1, 2, • • • AT 

From the third equation of (8.48), 

(8.49) f,(l 4“ ^cTd'A/) = Xi + Piyi — d)<Tif<r^ 

and by substituting (8.49) in the first and second equations of (8.48) we arrive 
after a little reduction at 
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(8.60) + = 

where X = (t^V I f ^ is known, a can be eliminated from the two equations 
of (8.50) and the resulting quadratic in ^ solved.' This quadratic reduces to 

(8.51) \rs^Syfi^ + (sx^ — — rSxSy = 0 

which has two real roots, one positive and one negative. If r is positive, the 
positive root is to be chosen. 

If the value of X is not known, we can add two further equations to (8.48), 
given by dL/daa = 0 and dL/dce = 0, but the solution of the complete set 
is not practicable. 

If the are not fixed, but are considered as random variables, which we may 
take to be normally distributed about zero with variance independently 
of the dz and Cz, x and y are normally distributed and the joint distribution is 
bivariate normal, but they are not independent. The regression of y on x is, 
therefore, linear. 

From (8.45), (8.46), and (8.47), we have 

E(x^) = 0 , E(y^) = a 

Var (xi) = cTi^ + (Xd^ 

Var (t/i) = + cTe^ 

CoY'ixx, yi) - E{ (^i + d^)(a + + ^i) } == 

Hence 

(tJ^ = (r|2 + CTd^j <7j^2 = ^ ^^2 

and 

p(Xx<Ty = 

The expected value of the slope of the regression line is, therefore, 
p<ry/<Tx == i 3 <rgV((r {2 ^ qt/) and so is not in general equal to /3. The squared 
standard error of estimate is cr/(l — p^) = + /3^(7/(l + a^/ and 

the actual variance of a prediction based on sample regression will agree 
with this to terms of order 1/n, n being the number of degrees of freedom. 

It is of interest to observe that the prediction made from the sample regres- 
sion of 2 / on X is better than we could make if we knew the true values of a 
and id. For suppose our estimate for a given x is Y ^ a + fix. Then 
Y ^ y = fi(x — — e, so that 

EiY -y)==0 

and 

- Yar (F - 2/) = 

which is greater than the variance of estimate given above. Hence, at least 
when n is large, the best predicting equation to use is the ordinary regression 
of y on X. 
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8.12 The Distribution of r When p = 0. We have already seen in § 8.8 that 
when = 0 (which means that p = 0) the quantity t = r[(iV — 2)/(l — 
has the i-distribution with N — 2 degrees of freedom. This enables us to 
determine whether a calculated r is significantly different from zero. 

Example 4. For a sample of 27, r = 0.36. The value of t is 1.929 with n = 25, giving 
P = 0.07 approximately. The observed r is, therefore, not significantly different from zero. 

The significance of r may be judged directly from tables (Fisher and Yates, Table VI, or 
Statistical Methods for Research Workers, Table V A). Thus for r = 0.36 and n == 25, we see 
at once from these tables that P lies between 0.05 and 0.1 and is, therefore, non-significant 
at the usual level. 


We proceed to find the frequency function for r when p = 0. We assume 
that we are dealing with a random sample of N pairs of independent observa- 
tions, x^, and for convenience we suppose that x and y are measured from 
their respective population means. Let the be subject to a linear orthog- 
onal transformation (see § 4.13) yielding JV variates of which we take tji as 


Then 


2 - 1 .’ - 2 !''’ 

= - yy + Nf 


Therefore 


= NSy^ + 


NSy^ 


and since the rji are normal and independent variates the sum, divided by 
(Ty^, is distributed like "^vith N — 1 degrees of freedom. 

Let us now take 

712 --= N^^hsy = ^(xi - x)(1/^ - y)/N^^Hx 
— x)y,/N^fhz 

We can do this, since is orthogonal to rji and the sum of squares of coeffi- 
cients of yi in 7)2 is equal to 1. Then 

^7)t^ = Nsy%l — r^) 

so that Nsy^l — r^)/(Ty^ is distributed like with N — 2 df . Also ni, or 
Nrhy^, is independently distributed as with 1 df . 

N 

Now = 7)2^/{7)2^ + = xiV(xi^ + xv~ 2 ^), and since |xi^ is a 7 (|) 

3 

variate and ixw-s® is a y[(N — 2)/2] variate, it follows from Theorem 5.3 that 
r* is a /3[|, (JV — 2)/2] variate. 

The frequency function of is, therefore, given by 
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(8.52) 


/(r^) d{r^) = 


(^2) -1/2(1 _ ^2)(A^-4)/2^(^2) 

B(|,(V-2)/2) ' 


0 < r2 < 1 


from which it follows that 

(8.53) E{r^) = (V - 1)-^ 

so that the standard error of r is (iV — 1)“^/^ 

Since d{r^) = 2r dr, and since goes from 1 to 0 and back while r goes from 
— 1 to 1, the frequency function of r is given by 

(8.54) /(r) dr = (1 - dr/B{i, (V ~ 2)/2), ~1 < r < 1 


This distribution belongs to Pearson Type II, a S3unmetrical bell-shaped 
curve if iV' > 5, The kurtosis (72) = — 6/(2^ + 1) and so tends to zero as 
iV — » 00 . As V becomes large the function is practically normal and conse- 
quently 

(8.56) t^r(N- 1)1^2 


tends to be normally distributed mth mean zero and unit standard deviation. 
Therefore, to test the significance of a value of r computed from a large sample 
(say 50 or more) it would not be invalid, to any appreciable extent, to refer 

(8.55) to a normal probability scale. 

We may observe in passing that 

(8.56) ■ Nsy^ == N(1 - r2)s,2 + 

that is, 

- y)^ = 

This means that the total sum of squares of deviations of y from the mean 
can be split up into a sum of squares of deviations from the regression line and 
a sum of squares of deviations of points on the regression line from the mean. 
These sums of squares, divided by ay^, are (when p = 0) distributed as with 
N — I, iV — 2, and 1 degrees of freedom respectively. Hence the mean 
squares, given by dividing the sums of squares by the df , are imbiased estimates 
of Moreover, the second and third of these mean squares are independent. 

We could, therefore, apply the F test to the ratio t\N — 2)/(l — r^), with 
ni = I and 712 = iV — 2, in order to determine whether or not the correlation 
is significantly different from zero. 

8.13 Confidence Limits for the Variance Ratio of Two Correlated Variates. 

If X and y are measured from their respective population means, their joint 
normal bivariate distribution is given by 

(8.57) fix, y) =^(2r^x<T.)“Kl - exp [-P/2(l -» pO] 
where 

P = — 2pzyl(T^fi„ 4- 
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It was pointed out by Pitman^ that u == xfax + y/<Jy and v = zjax — 
thenP/[2(l — p^)] can be written as + p) + i^V(l p)]- Hence the 

joint probability function for u and v can be written as the product of > func- 
tion of u and a function of v, so that u and v are independent normal variates 
with variances 2(1 + p) and 2(1 — p) respectively. If R is the observed corre- 
lation coefficient between u and R[(N — 2)/(l — has the ^-distribu- 

tion with iV — 2 df, or in other words, ig ^ /3[|, {N — 2)/2] variate. 

— u)(v - v) 


R = 


and on substituting u 

(8.58) E = 
where 


^{u - uY^{v - vYYi- 
x/(Xx + y^cfy, V == x/cx — y, Gy, we obtain 

[Sx^ 


Gx“ Gu 


and 


‘^'^^xSy/ GxGy) yK, (Sx“^ Gx“~{~^y"/ Gy" ‘2iTSxSy/ GxGyY^^"^ 

= ^(x - xy, Nsy = '^(y - ijy- 

NrSxSy = '^{x — x)iy — y) 

Sx^/sy^j CO = Gx^fGy^, (8.58) becomes, on multiplying numerator and 


If w 

denominator by Gx^lsy^, 


(8.59) 


R — (w -- i*i)/[{w + co)- — 


whence confidence limits for <o can be found (see Problem 5). 

8.14 The Distribution of r When p Is Not Zero. The exact distribution of 
r for samples from a correlated parent population was found by Fisher,^ using 
a geometrical method. The following analytical treatment is due to Sawkins.^ 
Writing t = x/gx, u — y/Gy^ the joint frequency distribution of f and u is. 
from (8.57), 

(8.60) f(t, u) = (27r)--Kl - exp [-P/2(l ~ p^)] 

where 

P = — 2ptu + u" 

= (t^ - pty + (1 - p2)p 

Hence, if v = (^^ — pC)(l — p^)"^/^^ the exponent in (8,60) can be written as 
—^2/2 — P/2, and the joint distribution of t and v is given by 

(8.61) j{t, v) = ( 27 r)-ie-^=/-c -^^/2 

t and V are therefore independent normal standard variates, and is dis- 
tributed as with N degrees of freedom. 

Let us now make an orthogonal linear transformation from the v^ to anew 
set of variates ^ 2 , * * * ^n, where we choose 


a = 

• = m'Ki - _ pi) 
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Then 

= (1 - - ‘^P^ut + P^^p] 

= (1 - - 2p 2(“ - zz)y ~ i) + p^^Cf - t)^ 

+ N(u^ - 2pUt + P^m 

= (1 - p=)-i[<S2=' - 2prS^^ + p^SP] + ?i- 

where 

Si^ = - ^)^ s/- = - ur 

Hence 

(8.62) = (1 - - 2prS^i + p^Si^ 

and this is distributed as with JV — 1 df . 

Now let us choose ^2 as 

^2 = Sr'^^iu - T)v^ 

which is orthogonal to and is such that the sum of squares of coefficients is 
equal to unity. Then 

- <)(w. - pQ 

== SrKi - p'^r^'^'ZiU -t){u-u- pit, - t)i 

= (1 - p^^/KrS, - pSi) 

SO that 

(8.63) == (1 ^ p^yi(r^S2^ - 2rpS2Si + p^Si^) 

From (8.62) and (8.63) we have 

= >522(1 ~ r2)(l ~ p2)-i 

3 

and this is distributed as x! with AT — 2 df. 

Moreover Si^ — — t)- is independently distributed as with N — 1 

df. We have then three statistically independent variates, namely, 

.a - ^2 = (1 - P^y^^KrS2 - pSi) 

(8.64) I b = = |&“(1 - r2)(l - p2)-i 

3 

‘ ^ — 1)2 = 

and these are respectively formal, 7[(2V’ — 2)/2] and y[(N — l)/2] variates. 
Their joint frequency function is, therefore, 

(8.65) fia,b,c) = 
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We now change the variables to r, Si, Si. From (8.64) the Jacobian of the 
transformation is 

Si — p r 
-rSi^ 0 Sid - r^) 

Q Si 0 

= - (1 - p^y^i^SiSi^ 

AlbO 

+ 6 + c = J(1 - p^)-H(r& - pSx)^ + (1 - + (1 - p*)>Si==] 

= Kl - pTHSi^ + Si^ - 2rpSiSi) 



The Joint frequency function of r, Si, Si, is, therefore, from (8.65), 

r + Si^ - 2rpSiSi l 

P^) J 


(1 _ r^yN-ws^N-is^N-i exp - 


(8.66) /(r, <81,82) 


2(1 


(1 __ p2)(W-l)/2(2,r)h22^-'?/2r 




\ 2 




and the frequency function for r is given by integrating with respect to Si and 
S 2 from 0 to 00 . 

This integration is not straightforward. Fisher found an ingenious trans^- 
formation from Si and S 2 to new variables a and such that 

S, = S2 = 

The Jacobian iS2/a, /3) = — and the part of (8.66) depending on 
Si and S 2 becomes 

a^-^exp [— Q:(cosh/9 — pr)/(l — p^)] 


The limits of a are from 0 to 00 and those of ^ from — oo to oo. The 
integration with respect to a gives 

T{N ~ 1)(1 - p2)^-“V(cosh0 - 

and on noting that 

2 W- 3 r r = 2 ri/T(iV - 2 ) 

we obtain for the frequency function of r, 

(8.67) f(r) = t-\N - 2)(1 - r’^)<J^-«/2(l - p 2 )w-i )/2 


f 


(cosh j8 

The integral can be expressed as a hypergeometric function 


T(N - 1) 

m -t) 


1 , (1 - pr)-(^-*/«F (h h ^ i - ^ 


\2 2 ' 2 ’ 2 
and we ifinally obtain f(r) as a rapidly convergent series/^ 

(8.68) f(r) = - 2)r(W - 1)(1 - p^)(^-»/^(l - 


(22r)'/T(W - |)(1 - pr)^-«/2 

1^1 , 1 pr + 1 , 9 (pr + 1)^ 


pr)^- 




y I 1 4. - ^ j ^ . 

^ ^ ^ 42iV - 1 ^ 16 (2W - l)(2Ar + 1) 


+ 
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When p = 0, the integral in (8.67) is readily evaluated as 2^"®B[(iV — l)/2, 
{N — l)/2], and (8.67) then reduces to the simple form given in (8.54). 

Tables of f(r) and f f{r) dr have been prepared by Miss David ^ for the 


whole range of r and p and for all sample sizes from 2 up to 25. Values for a 
few larger sizes are also given, as well as charts from which the confidence 
limits for p can be determined for any given r and N. 

The moments of the distribution of r can be expressed in the form of series. 
Thus 

p(l - P°) 


(8.69) 


E(t) = p 
Var (r) = 


7i 


2n 

(1 -p^)- f 
n 

■n}i 




+ • • • 

77 - 30 


•] 


12?i 


+ 


6 


72 = ^ (12p^- ^ 1) + 


where n is written for iV -- 1. It is apparent that the distribution is far from 
normal unless n is quite large. If the samples are large {N > 400) and if p is 
small or only moderately large (1 p | < .6 perhaps) then it is true that r is 
approximately normally distributed about the value p with standard devia- 
tion 

cr. = (1 - p^-)(iV - 1)-^/^ 

It is customary, under these conditions, to attach to an observed value of r 
a standard error 

cr, = (1 - r^){N - l)-i/2 

and, for a proposed p, to refer the computed value of 

(Tr 

to a normal probability scale. 

This procedure is invalid, however, if N is small and p is large. The dis- 
tribution of r from small samples is skew and the skewness increases with p. 
This may be understood intuitively by considering the distribution of r^s from 
a imiverse in which p is .9. The range of possible variation of r above p is 
only .1. But the possible range below p is 1.9. Accordingly the sampling 
distribution of r (N small) from this universe will be sharply skew, as is evident 
from (8.69). (An extensive cooperative study of the distribution of r by 
Soper® et ah is now only of diistorical interest.) 

The upper panel of Figure 26 (from Fisher’s book) shows the r curves for 
two values of p with iV' == 8. They indicate the rapid departure from normal- 
ity that may be expected from small samples as p approaches high values. 
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It may be observed from (8.69) that r is a biased estimate of p. An approxi- 
mately unbiased estimate is given by putting 

r = Eir) - p - p(l - p2)/2n 

and solving this equation for p. We obtain, to terms of order 1/n, 

p = r[l -f- (1 — r2)/2n] 

This result is, however, different from that obtained by maximizing log/(r) 
for variations in p, which gives a kind of maximum likelihood estimate. It 



Fig. 26 

is easily proved that, to terms of order 1/n, p = r[l, — (1 — r^)/2n]. This 
estimate has minimum variance for large n among ^11 nearly unbiased estimates 
(that is, estimates with a bias of order 1/n). 

8.15 Fisher’s z^-Transformation. In his study of the sampling distribution 
of the correlation coefficient Fisher found that it was not desirable to use r 
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as the independent variable and he introduced a transformation which has 
distinctive merits. He showed that the quantity* 

(8.70) 2 ' = I loge = tanh~^ r 

is approximately normally distributed and is nearly constant in form as p 
changes. The lower panel of Figure 26 shows the distribution curves for 2 ' 
corresponding to the r curves in the upper panel. The standard deviation is 
approximately 

(8.71) (N ^ 
and is practically independent of p. 

If I* = tanh~^ p, it may be proved from the known distribution of r that 


(8.72) 


Eiz') 






7i = 


U' 


3/2 


+ 


2 . 4 + 2p2 -- 3p^ 

T^ = -+ 

For moderate values of 71 , therefore, the skewness is much less for the g'-dis- 
tribution than it is for the r-distribution. If we write the variance as 


n 


— h ni_ n 


and choose an integral value of k to give approximate agreement with the 
value in (8.72) we must clearly put fc == 2. The approximate variance of z' 
is, therefore, (n — 2)“^ = (iV — 3)-^ 

Fisher^s transformation is applicable in the following tests (among others) : 

(а) To test if an observed value of r differs significantly from a proposed 
theoretical value. 

(б) To test if two observed values are significantly different. 

The procedure for (a) is to calculate 

t = r)(iy - 3)1/2 


and refer the result to a normal probability scale. For (6) the procedure is to 
find, in accordance with (8.70) , the two values of 2 ', say z\ and z^ 2 , corresponding 
to the two observed values of r, say n and r 2 from samples of Ni and N 2 J respec- 
tively. Then compute d — z '2 and <Td = — 3) + 1/(N2 — 3)} 1/2 

and refer 


* This quantity is not quite the same as the z used for the ratio of two variances and so 
we use a prime here to distinguish between them. (See § 9.18 ) 
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to a normal probability scale. For exact work the bias indicated in the first 
equation of (8.72) may be allowed for. 

Example 5. In a class of 20 students the correlation coefficient between the scores in two 
different tests is r = 0.65. Is it likely that the true coefficient of correlation is as high as 0 5*** 

Here 

r = llogM =0 5493 

and {N — = 0.2425. Hence ( 2 ' — ^)(N — = 0.932, corresponding to a prob- 

ability of 0,176 of obtaining a value of r as high as 0.65 when the true p = 0.50. To allow 
for the bias, we may put f + p/2n = 0.5493 + 0.5/38 instead of This reduces the stand- 
ard variate to 0.877, giving P — 0.190, In either case, the true coefficient of correlation 
may well be below 0.5. 

We may establish approximate confidence limits for p by means of the 
^'-transformation. For f, 95% confidence limits will be given by 
2 ' ± 1.96(iV — 3) and the corresponding values of p may then be cal- 
culated. Thus, for the example above, the confidence limits for f are 
0.7753 ± 0.4753 = 0.3000 and 1.2506. These correspond to p = 0.29 and 
p = 0.85. An approximate correction for the bias may be made by sub- 
tracting r/2n = 0.65/38 from z'. This changes the confidence limits for f to 
0.2829 and 1.2335, corresponding to p = 0.28 and p = 0.84. The actual 
limits as given by David^s chart are 0.28 and 0.83. It is evident that the 
normal approximation is satisfactory even in this case. (Correlation coeffi- 
cients derived from samples as small as 20 are generally of little practical 
value.) 

Fisher has given a table^ for converting r to z\ but a 4-place table ^ of tanh x 
is at least as convenient to use. 

The 2 ;'-transformation is convenient in taking the average of correlations 
from a number of samples, supposedly from the same population. Thus, 
if ri, r 2 , * • • Tk are the sample coefficients obtained from samples of sizes 
Nij • • • Nkj the weighted mean of the z' will be 

r = - 3) 

each z^ being weighted inversely as its variance. The mean r is then given 
by f == tanh 2'. 

E. J. G. Pitman^® has given a distribution-free test of the significance of r 
when p = 0, that is, a test which is independent of any assumption about the 
distribution (normal or not) of the parent population. The test consists in 
comparing r for the sample, with r for all other samples consisting of the same 
set of X and Y values but paired in all possible ways, all such pairings being 
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considered equally likely. With small numbers an exact test is possible, but 
with moderately large N and for a parent distribution with not too great yi 
and 72 the distribution of r is approximately that of (8.54), so that isa 
Plh 2)/2] variate. A significant value of for any assigned level of 
significance can, therefore, be calculated from the tables of the Incomplete 
Beta Function, 

8.16 Rank Correlation. It sometimes happens that, while two attributes 
X and Y cannot be accurately measured, individuals possessing these attri- 
butes can be ranked in some definite order. The product-moment coeflScient, 
calculated by replacing the unknown actual values of X and Y by their ranks, 
is known as the coefficient of rank correlation; its use was proposed by Spearman ® 
in 1904. 

If X and Y now denote the ranks, and if for the present we ignore ties, the 
mean values of X and Y will be X == F = |(iSr + 1). Putting a: = X — X, 
y = Y -- Y, we have for the rank coefficient 

(8.73) r' = 

summed over the N sample values. Now the sum of the first N integers is 
lN(N + 1) and the sum of their squares is N(N + l)(2iV' + l)/6, so that 

= N{N + l)(2^■ + l)/6 - N{N + 1)V4 

(8.74) = Nim - 1)/12 

has the same value. Also ii d^ = = Xt — yt, we have 

= mm - l)/6 - 2'^xy 

Therefore 

N(m - 1)/12 - 
Nim - 1)/12 

(8.75) - 1 - - 1) 

which is the usual formula for computing r\ If the X's are arranged in a 
table in their natural order, and the F's placed alongside, it is a simple matter 
to compute the dij and hence r'. For samples less than about 40, r' is easier 
to compute than r, and it is principally for such small samples that r' is used. 

Example 6. 

X 123456789 10 

Y 1-2, 5384967 10 

d? 004194444 

Hence = 30, 77 = lo, so that r' = 1 - 180/990 = 81/99 = 0.82. 


0 
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It was shown by Hotelling and Pabst^^ that r' can be used as a test of the 
existence of correlation in populations of any type, not necessarily normal, 
and they obtained exact tests for small samples. For large samples the dis- 
tribution of r' approximates a normal distribution. 

If we assign the ranks 1,2, • • -N %o the variates Xi, ^” 2 , • • * Xv, and if the 
y^s are independent of the X^s, the actual set of ranks in any sample corre- 
sponding to Yi, Y 2 , Yn will be any one of the N\ equally likely permuta- 
tions of the numbers 1 to N, The probability of any given value of r' is, 
therefore, proportional to the number of permutations giving rise to it. For 
small values of X, only a few values of r' are possible. For X = 3, for ex- 
ample, r' is either — 1, — | or 1, with probabilities -i-, | respectively, 

as is easily seen by writing down all the permutations of 1, 2, 3 and computing 
r' for each. The two extreme values, r' == ± 1, correspond respectively to the 
X and y being in precisely the same order and in precisely reverse order, 
and both these values have probability 1/X!. 

From (8.73) and (8.74) we have 

NiN^ - 1) 


Since we are assuming independence, 


and 


X(r') = 0 

' Var (/) X(r'2) = 144X-2(iV2 - ly^Ei^xy^ 


Now the take the same values in all samples, a set of X consecutive 
integers centered at zero (if X is odd) or numbers of the form • • • ~2|, 
— If; ““I; h li? 2|, * • * (if X is even). In either case ^x^ = 0 and = 
X (X^ — 1) /12. Also, if i is not equal to j, ^x^X 3 = (^x^) ^ — ^x^^ = — ^x^^. 
Since the yt are the same numbers as the x^, E(y^) = 0, E(yi^) = (X^ — 1)/12 
and E(ytyj) = — (X^ — 1)/12(X— 1), there being X(X — 1) terms of the form 


Also 


so that 


Hence 

(8.76) 


+ '^x^XJy4JJ 
i 


Ei^xyY = ^x^^E(y^^) + ^x^XJEiy^yJ) 

NjN^ - 1 )^ N(N^ - 1 )^ 

144 144(iV' - 1) 


Vax (rO == ^ + 


N ' N(N -1)~ N -1 


This formula was originally obtained by “ Student.” The calculation of the 
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higher moments of even order is long and complicated, 
are all zero, from symmetry. It turns out that 


(8.77) 


114 6 

25JNr 


The odd moments 


and so ten ds to zero for large N, and that in general the 2ath moment of 
r'y/N — 1 tends, as N increases, to the value (^a)\/a\ 2“ which is the 2ath 
moment of a standard normal variate. 

8.17 Kendall’s Method of Rank Correlation.^^ M. G. Kendall has sug- 
gested a different method of computing a rank correlation, which has certain 
definite advantages. If, as in Example 6, the Xi are arranged in a row in 
their natural order of increasing rank and the Yi are placed as they come, 
imdemeath, we count for each number in the row of Y how many other 
niunbers there are lying to the right of, and greater than, that number. If, 

N 

for the nth number, the number counted is 4>n, "we compute K — "^4>n and 

n~l 

T = 4K/[lV(iV — 1)1 — 1. Then r is a measure of the agreement of ranks in 
A and B. 

It is easily verified in Example 6 that K = 9-|-8-l-54-6-i-2-f-4-f-l-l- 
2 -f 1 = 38, whence r = 0.69. If the ranks of X and Y agree completely, 
T = 1, and if the ranks of Y are the exact reverse of those of X, t = — 1. 

A useful check on the calculation is provided by counting all greater 
numbers to the l^t of each number. If L is the sum of these counts, 
K + L^ iNiN - 1 ). 

The distribution of values of t when a given ranking is correlated with each 
of the Nl possible permutations of the ranks may be found fairly readily. 
Like that of r', this distribution tends to normality as iNf — » oo , but it does so 
much more rapidly. For N > 10, the normal approximation is quite close. 
The variance of t is 2(2^ -f 5)/92V(iV‘ — 1), the skewness is zero, and the 
kurtosis (72) is — 54/25iV -f 0(i/N^). 

8.18 Ties in Rank Correlation. The customary procedure when there are 
ties in the ranking is to dmde the corresponding rank numbers equally among 
the variates concerned, using fractions where necessary. Thus, if a group of 
14 students were given letter grades as indicated below, and these grades were 
interpreted as ranks, we should have 

Grade A+ A A A" B+ B B B B" C+ C C 0“ D 

Rank 1 2.5 2.5 4 5 7 7 7 9 10 11.5 11.5 13 14 


When there are ties in one or both of two rankings, Spearman’s rank order 
formula (8.75) does notrgive the same result as the product-moment correla- 
tion of ranks, since equal to N(N^ — 1)/12. 

The presence of ties affects also the variance of the rank correlation. For 
example, if one ranking is untied and the other contains sets of h, h, ■ ■ ■ tied 



Sec. 19 


Contingency Tables 


227 


members, the variance of r is reduced by the amount f — 1) (21 + 5)/ 
~ 1)^- Thus, in the example given, if the ranking were correlated 
mth another (untied) ranking, we should have h = 2, h = 3, ^3 = 2, so that 
— 1)(2^ + 5) would be 102 and the variance would be reduced from 
0.0403 to 0.0396. 

8.19 Contingency Tables. Very frequently in experimental work we deal 
with some characteristics or attributes that are not susceptible of accurate 
measurement, although it is possible to divide the population into two or 
more categories with reference to these attributes. 

An example was given in § 8.5, where the division into two categories 
produced a 2 X 2 table (Tables 12 and 13). For other attributes, such as 
hair color, it may be desirable to have four or five categories. Such tables 
are called contingency tables. For the special case of a 2 X 2 table, the method 
of tetrachoric correlation described in § 8.5 gives a measure of the association 
between the two attributes, but for other sizes of table some other method of 
measuring the association is necessary, and even for 2 X 2 tables this other 
method is generally preferable. 

Let us suppose that we have two attributes denoted by A and B and that 
our sample of N is divided into s and t classes with respect to these two attri- 
butes. The resulting frequency table will be of the form shown in Table 14 
where a = 5 and ^ = 3. 

Table 14 



Bi 

Bz 

B, 


A, 

fn 

fl2 

/l3 

ri 

Az 

fn 

/22 

fn 

Tz 

A, 

fn 

fz2 

fzz 

Tz 

Ai 

/ai 

f42 

/43 

Ti - 

As 

/si 

h% 

fhz 

Ts 



Cz 

Cz 

N 


The row marginal totals are denoted by ri, r 2 , • • • and the column marginal 
totals by ci, C 2 , • • • Ct. 

If the attributes A and B are completely independent, the proportions in 
the different B-categories will, in the parent population, be the same, irrespec- 
tive of the distribution of the A-categories. That is, if we select a sub- 
population consisting only of those individuals with attribute Am, the propor- 
tions pmij prn 2 , * * * Pmty iu the varfous J5-categories of this sub-population will 
be the same as in the corresponding categories of the whole population. The 
same thing, of course, holds also for the proportions in the various A-categories 
corresponding to a fixed 5-category. Hence, if the observed frequencies in 
the sample reflected precisely the corresponding proportions in the parent 
population, we should have, in the case of complete independence, fmn/rm = 
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Cn/N or fmn = TmCnlN. Deviations from these equalities in the observed 
sample are to be attributed, on our hypothesis of independence, to sampling 
fluctuations. 

It is conventional to consider the probabilities of such deviations relative to 
the set of all possible cell-frequencies with the same marginal totals. That is, 
the Tm and c* are treated as fixed, and the expected frequency in the mnth 
cell is then - r^cJN. The value of x® is summed 

m.n 

over all the st cells in the table. Since all the marginal totals are fixed, how- 
ever, the number of degrees of freedom is reduced by s + ;f — 1 (not s + t, 
because the sum of the row marginal totals must be equal to the sum of the 
coluinn marginal totals.) Hence the number of degrees of freedom for is 
(s — 1)(3{ — 1), and the significance of an observed can be determined. 

A measure of the degree of association between the two attributes considered 
is provided by Pearson^s coefficient of mean square contingency. This is defined 
by 

(8.78) C = [xV'(x^ + - rV(\^^ + 1)]^^^ 

where f ^ = xV^* Fi*om the definition of x^ we have 

X^ = ~ 2/mn + <l>mn] 

= ^fmrr/<t>mn ^ N 

- iV[2(/mnV>>nCn) ~ 1] 

since 2 ~ (This is, in fact, the simplest way of computing 

X^.) It follows that ^|/^ + I — ^(fmn^/rmCn) = jS, say, and that 

(8.79) C = [{S- l)/>S]^/2 

Even with perfect association between the two attributes, C is not equal to 1, 
although it tends to this value as the number of rows and columns increases. 
For a 2 X 2 table, C cannot exceed 1/V2 = 0.707. If the parent universe is 
assumed to have an underlying bivariate normal distribution, then C tends 
with finer and finer subdivisions to equality with the Pearson coefficient of 
correlation r. 

Example 7. The following results were obtained in an investigation of the association 
between ^ q eft-handedness ” (determined by a balancing test) and ‘Meft-eyedness” (as meas- 
ured by general astigmatism). 

Table 15 



Left-eyed Amhiocular 

Right-eyed i 

Totals 

Left-handed 

' . 34 

62 

28 

124 

Ambidextrous 

27 

28 

20 

75 

Right-handed 

57 

105 

52 

214 

Totals 

i 

118 

195 

100 

413 




Sec. 20 


Contingency Tables 


229 


Here 


124 X 118 100 X 214 

X* = 413(S - 1) = 4.020, and 


= 1.009734 
C = 0.0096 


The value of x* is certainly not significant with 4 degrees of freedom, and the coefficient C is 
very small. There is very little evidence against the assumption of complete independence. 


8.20 The 2 X n and 2X2 Contingency Tables. For the special case of a 
2 X n orn X 2 table, the calculation of may be simplified. If the table is 
arranged as in Table 16, where A and B are the totals of the a^ and b^ respec- 



Table 

16 


ai 

Oo • * • 

an 

A 


62 • • • 

hn 

B 

Cl 

C2 • • • 

Cn 

N 


tively, the proportion of a^s in the iih. column is 'p^ = a^/c^, and the proportion 
of Vs is The weighted mean values of and are p = A/N and 



which is the formula attributed to Brandt and Snedecor.^^ Either the Vs or 
the Vs can, of course, be used in (8.80), the smaller frequencies being generally 
chosen. 

For a 2 X 2 table, arranged as in Table 12 (§ 8.5), the contribution to 
arising from the cell with observed frequency a is (a — aY/a^ a = (a + 5) X 
{a c)/N, Now a — a = {ad — be) /Nj and because of the assumed con- 
stancy of the marginal totals, this difference between observed and expected 
frequencies is numerically the same for each cell, differing only in sign. 
Hence x® = {ad — bc)^(l/a + 1/0 + l/y + l/8y/N^. The second bracket 
on the right is easily seen to reduce to N^/{a + 6)(a + c){b + d)(c + d), on 
substituting for a, 0t y, S, and remembering that a + b + c + d = iV. (0j 
7 , d are expressions like a for the other three cells.) We have then 
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(8.81) 


2 Njad — 6c) ^ 

^ ~ (a + b){a + c)ib + d){c + d) 
_ Njad - bcY 


where n, are the two row totals and Ci, c% the two column totals. 

The distribution in a contingency table is necessarily discontinuous, whereas 
the distribution is continuous. The approximation to is something like 
the approximation of the discontinuous binomial distribution to a normal one, 
where, as was noted in Chapter II, the calculated frequency between two 
values of x, say a and h inclusive, is given by the area under the corresponding 
normal curve, not between a and b but between a — -I and 6 + |. Similarly, 
as was suggested by Yates, the approximation to is improved by replacing 
one cell frequency, say d, hy d ± ^ according 2 bS ad ^ bc^ and adjusting the 
others to keep the mar^nal totals unaltered. The effect is to replace ad — be 
in (8.81) by I od — 6c [ — iV'/2. This is known as Yateses correction for con- 
tinuity, It undoubtedly improves the estimate of significance for a 2 X 2 
table, and should always be applied unless the cell frequencies are all quite 
large, say 500 or more. In using the x^ test for tables with small values of iV, 
it should be borne in mind that the quantity on the right-hand side of (8.81) 
actually has the x^ distribution only in the limit as N tends to infinity. Even 
with the Yates correction, it cannot be assumed that for small values of N 
the probabilities calculated from x^ will be accurate. 

8.21 The Exact Distributions for 2 X 2 Tables. There has been some dis- 
cussion in the literature concerning the proper method of dealing with 2X2 
tables. The situation has been clarified in two papers, one by G. A. Barnard 
and one by E. S. Pearson in Biometrikaj 1947, pointing out that there are 
really three distinct problems, each of which gives rise to a 2 X 2 table, al- 
though the underlying probability conceptions are different. 

Problem J. This is the one usually considered, in which both sets of mar- 
ginal totals are fi,xed. It is called by Barnard the 2X2 independence triah 
The mathematical model corresponding to Table 14, with a, 6, c, d written 
for /ii, fi 2 j / 21 , / 22 , is that of N balls in an urn, of which n are marked Ai and r 2 
are marked A 2 . The balls are withdrawn in random order and put into a row 
of N boxes, one ball to a box, of which Ci are marked Bi and C 2 are marked 
The number of balls marked Ai in the boxes marked Bi will he fu or a, and 
similarly for the three other cells in the table. 

The probability of a Ai^s and c A 2 ’s in the Ci boxes marked Bi is given by 
the hypergeometric law [see (2.66)], since this is a problem of sampling without 
replacements. It is - ^ 

/rxYr2\ /fN\ ^ n\r2\ci\c2\ 

\aAoj/\ci) alblcldlNl 


When the distribution in the boxes marked Bi is fixed, that in the boxes 
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marked B 2 is also determined, so that the probability of the observed 2X2 
table, in which we suppose that d is the smallest frequency, is 

(8.82) p' = (ri! rs! Ci! C 2 !)/(a! 5! c! d\ iV!) 

R. A. Fisher derived from this expression his exact test/^ This consists 
in computing the total probability of the observed distribution and of all 
less likely ones in the same direction, that is, for all values of d from 0 up to the 
observed value (if d < 5). This probability F = po' + pi' + P 2 ' + * * * + Vd 
corresponds to one tail of the distribution, and thus is comparable with half 
the probability calculated from since the latter corresponds to both tails 
of the distribution. If d > 5 the tail is from d up to C 2 inclusive. 

Example 8. The data are intended to exhibit a relationship between inoculation and 
immunity from attack among a population exposed to a certain disease. 



Not inoculated 

Inoculated 


Not attacked 

3 

6 

8 

Attacked 

10 

2 

12 


13 

7 

20 


For this table, x* = (44® X 20)/(8 X 12 X 13 X 7) « 4.43, corresponding to P = 0.035. 
With the Yates correction, is reduced to 2.65, corresponding to F = 0.103, so that the 
correction changes a significant probability to a nonsignificant one, on the customary level. 
The probability of the observed distribution is (8! 121 13! 7!)/ (3! 5! 10! 2! 20!) = 0.0477, and 
the probabilities of the two more extreme ones corresponding to d = 1 and d = 0 are 0.0043 
and 0.0001, so that Fisher's P = 0.052, or, for both tads, 0.104. It is obvious that Yates's 
correction makes a great improvement. 

The chief objection to Fisher’s exact test is the large amount of computa- 
tion involved when the cell frequencies are at all large. E. S. Pearson has 
pointed out that a normal approximation, with a continuity correction equiv- 
alent to Yates’s correction, gives, as a rule, a surprisingly good result. The 
method consists in calculating, for the cell frequency d, the expected value 
5 = r^pi/Nf and the variance = rir^iC 2 /N\N — 1), as given by (2.77) and 
(2,80). Then the quantity 


t = (1 d — 5 1 — \}/<Td 
is treated as a normal variate and 

For Example 8, 5 = 84/20, <r/ - 1.149, t = 1.7/1.072 = 1.586 so that 
P == 0.056 which agrees fairly well with the exact value.* 

* Mainland, see reference (17), has given extensive tables based on Fisher’s method, for 
estimating the significance of observed 2X2 distributions without going through the labor 
of computation. 
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Problem II. This is the test of whether the proportion of individuals hav- 
ing the characteristic At is the same in two different populations, distinguished 
by the characteristics Bi and Bi, a random sample being drawn from each. 
Barnard calls this the 2 X. 2 comparative trial. The mathematical model is of 
two urns each containing a very large number of balls, these balls being all 
labeled either Ai or A^. In the first um (JBi) the proportion of 4i’s is pi, 
while in the second um (Bi) it is pa. A random sample of ci is drawn from um 
I and contains a Ai’s and c Aj’s. A random sample of Cj from um II contains 
b Ai’s and d Aa’s. The hypothesis Ho to be tested is that pi = pt — p. 

If this hypothesis is tme, the probability of the observed result is 


i p“(i - py 


a! b\ c! d! ' 


which is equal to Fisher’s expression p' (8.82) multiplied by a factor 




Here, of course, the conditions are different, because we are no longer insisting 
on constant row totals. In various repetitions of the experiment the row 
totals can vary, although the column totals are still fixed. 

In this problem the basic probability set, with reference to which proba- 
bilities are calculated, is two-dimensional, instead of one-dimensional. The 
set of possible values'of a and 6 (0 < a < Ci, 0 < 6 < C 2 ) is a lattice of pointy 

as represented in Figure 27. It is 
fairly clear that for points lying 
near the diagonal OD there will be 
little reason to reject the hypothe- 
sis that Pi = p 2 y whereas the argu- 
ment for doing so gains force as the 
point (a, 6) moves toward one of the 
comers A or B, We should like to 
be able to draw lines LJ on this 
diagram, cutting off at each end of 
every diagonal, n = constant, a 
group of dots (shown in black) for 
which the total probability is 
Fig, 27 equal to a. If this were done, and 

if the null hypothesis were rejected 
whenever (a, b) fell in^the region of the black dots, the chance of committing 
an error of the first kind (that is, rejecting Eo when it is true) would be 
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and so would be independent of the unknown parameter p. Although this is 
not practicable, Barnard has given a systematic method of classifying the 
points in the lattice, and Pearson has suggested that, when all the marginal 
totals are fairly large, the quantity w = (a ~ d) /<ro may, on the null hypoth- 
esis, be treated as a standard normal variate. Here d and o-a are given by 
the hypergeometric formulae 

d = nci/N 

(Ta^ — rir2iCiC2/N^iN — 1 ) 

If, in Example 8, we think of the 13 not-inoculated and the 7 inoculated 
persons as independent random samples from two exposed populations (the 
not-inoculated and the inoculated), in which the numbers of persons actually 
attacked may vary from 0 to 13, or from 0 to 7, respectively, we have a prob- 
lem of type II. For the observed a,u — — 2.05, corresponding to F == 0.020, 
which suggests significance. 

Problem III. This is the case of a double dichotomy. It is assimed that 
in the parent population there is a probability po that an individual selected 
at random will have the characteristic Ai and an independent probability p6 
that a random individual will have the characteristic Bi. The probabilities 
of the four possible combinations (AiBi, AiB^, AJBi, A%B^ are, therefore, 
PlI = PaPft, pl2 = Pa(l - Pfe), p21 == (1 - P«)p6 and P 22 = (1 “ Pa)(l - Vh) 
respectively. The probability of the observed sample is given by the multi- 
nomial law and is 


N\ 

alb\c\d\ 


Pll^Pl2^P2l"P22‘' 


V' ■ Prt ■ Pc 


where p' is Fisher’s expression (8.82), and pr,, Pc, are the binomial probabilities 
for the row totals and the column totals respectively. The basic probability 
set is now three-dimensional, since the column totals as well as the row totals 
may vary, and little has been done on the exact treatment. Unless some of 
the marginal totals are very small, it appears, however, that the usual 
approximation is reasonably adequate. 

8.22 The Combination of Probabilities from 2X2 Tables. Sometimes, in 
a group of related experiments, it is desired to estimate the over-all significance 
of the results. R. A. Fisher has given a method of combining probabilities 
which is useful in such cases. If a continuous variable x has the frequency 
function /(x), and if F is the probability that x does not exceed the value xi, 
then 

F - 

and has a range from 0 to 1. Treating F as a random variable depending on 
Xi with frequency function g(F), we have 

giP) dP = fixi) dxi 


i: 


f(x) dx 
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But dP = f(xi) dxi, so that g{P) = 1, This means that P has a rectangular 
distribution on the range 0 to 1. 
li u ^ — 2 loge P, we have 

du 2 
dP ~ P 

so that, if the frequency function of u is h(u), 

h{u) \du\ = dP 
or 

Ku) = f = I 

The distribution of u is, therefore, the same as that of with 2 df . 

If now we combine k independent probabilities, the combined probability 
is the product of the k separate probabilities, or 

W = - 2 loge P = - 2 loge (P 1 P 2 • * Pk) 

= - 2 ^ loge P^ = 

and so has the x^ distribution with 2k degrees of freedom. 

Example 9. The one-tailed probabilities on the basis of the null hypothesis obtained 
from three related 2X2 tables are 0.0178, 0.0214, and 0.0C52. What is the probability of 
the combined result? 

Here logioP,* = —5.7032, so that w == 2 X 2.3026 X 5.7032 « 26.26. 

The corresponding P for 6 degrees of freedom is 0.0020, which also is a one-tailed proba- 
bility. It may be doubled to give the customary two-tailed significance probability, but in 
any case is obviously highly significant. 

In a 2 X 2 table, the distribution is actually discrete instead of continuous, 
so that the distribution of P is not really rectangular, but for moderately 
large frequencies, the approximation is satisfactory. 

Problems 

1. Establish the truth or falsity of the following proposition: A necessary and sufficient 
condition that two variables be normally correlated is that their regression systems be 
linear. 

2. Prove that the regression systems of two normally correlated variables are linear 
and homoscedastic. 

3. For (8.13) prove the following: 

(а) the mean value of yx taken over all values of x is zero, (y* - of (8.1)) 

(б) the variance of yx is equal to 

(c) the correlation coefficient between yx and y is equal to p. 

X oo ^00 

I y) dy dx 

-00 J—oo 

(6) erj* = f* r vMx, v) dy dx 

•/— 00 J— 00 

(c) Evaluate f f ^^f{xty)dydx 

J—<n J— 00 
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4. Prove the statement in § 8.7 that a is normally distributed about a with variance 
CeyHSz^ 4 - X^)/NSx^. 

6. Show that the 100(1 — «)% confidence limits for a> obtained from (8.59) 

are given by 

CO® — 2AW(a + tfJ® — 0 

\s here 

4 - 2 + 2(1 ^ r^)tj 

N -2 


and therefore are equal to 


Aw ± w{A‘^ - 1 ) 1/2 


6. Obtain the maximum likelihood estimate of p from the sampling distribution of r, 
(8.68), by putting d log /(r) /dp = 0 and solving the resulting quadratic equation for p as far 
as terms of order 1/n. (Take only the first term of the series in (8.68).) 

7. Prove that ^ for the table 



h-i 

c - i 

d 4” 4 


Ar(l ad -hc\- I)' 

is given by x* = 7 — rTTT — ^ ;rT '" 3v if 

^ (a -f 6)(c + d)(a 4- c)(6 4- d)’ 

8 . One random sample of 28 from a certain bivariate population gave r = 0.60; another 
independent random sample of 23 gave r = 0.40. Is the difference significant? (Use a tw’o- 
tailed test in estimating the probability, since we are interested here in the numerical value 
of the difference, and not in the sign.) 

9. A correlation coefiBcient of 0.561 is said to be highly significant. Assuming that this 
refers to the 1% level of significance, what is the least number of pairs of observations that 
must have been made in order to warrant the statement? Ans. 20. 

10. For the data of Example 2, § 8.8, plot the observed regression line and the band on 
either side of it contained between the upper and lower confidence limits for y. Note that 
this band is not of uniform width, although the width does not vary a great deal. For large 
values of N the edges of the band are practically parallel to the regression line. 

The following three problems are from Ftsher^s booh 

11. For the twenty years 1885-1904, the mean wheat yield of Eastern England was found 
to be correlated with the autumn rainfall; the correlation was found to be —.629. Is this 
value significant? 

12. In a sample of N = 25 pairs of parent and child the correlation in a certain character 
was found to be .60. Is this value consistent with the view that the true correlation in that 
character was .46? 

18 . Of two samples the first, of 20 pairs, gives a correlation of .6, the second, of 25 pairs, 
gives a correlation .8. Are these values significantly different? ^ 

14 . The following table gives average annual wheat yield (bushels/acre to the nearest 
bushel) and effective rainfall (to the nearest inch) for the Calgary district of Alberta, 1910- 
1937. Effective rainfall 'is defined as the rainfall during September and October of the 
previous year plus rainfall in May, June, July, and August of the specified year. 
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Year 

Wheat 

Yield 

Rainfall 

Year 

Wheat 

Yield 

Rainfall 

1910 

13 

5 

1924 

9 

11 

1911 

21 

17 

1925 

17 

14 

1912 

18 

19 

1926 

18 

15 

1913 

20 

14 

1927 

28 

30 

1914 

15 

13 

1928 

26 

17 

1915 

31 

17 

1929 

11 

10 

1916 

27 

12 

1 1930 

17 

11 

1917 

19 

9 

1931 

13 

9 

1918 

8 

7 

1932 

19 

17 

1919 

12 

7 

1933 

10 

10 

1920 

21 

15 

1 1934 

12 

12 

1921 

11 

8 

1935 

13 

14 

1922 

11 

9 

1936 

6 

9 

1923 

28 

21 

1937 

8 

13 


Calculate the regression coefficient of wheat yield on rainfall. Obtain 95% confidence 
limits for this coefficient. Calculate the coefficient of correlation, and find 95% confidence 
limits for it. If wheat yield y is estimated from rainfall x, obtain an expression for the 95% 
confidence limits of y as a function of x. Am, ^ = 0.95 ± 0.38, r = 0.71, p = 0.45 to 0.85, 
y = 4.16 + 0.946X ± 10.1[1.036 + (x - ^)V717F. 

16. From the following 2X2 table can one conclude that the medical condition known 
as synostosis of the sternum is associated with tuberculosis? 

Withcnd Synostosis With Synostosis 

Without T.B. 66 7 

With T.B. 7 4 

Hint* The smallest expected frequency is here so small that the x* test is unreliable. 
Note that the frequency 4 is above expectation, so that the tail of the distribution of possible 
frequencies corresponds to values even greater than 4. There is a marked asymmetry in the 
distribution, the only possible values of d which are less than expectation being 1 and 0. 
Fisher and Yates (Statistical Tables^ VIII) give a table for use whenever the smallest expec- 
tation in the 2X2 table is less than 10. The 2,5 and 0.5 per cent points of the corrected 
Xe (the square root of x^) a-re calculated for the two tails separately. In the above example 
we would use the longer tail. 
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ANALYSIS OF VAIUANCE AND COVARIANCE 


9.1 Analysis of Variance. One-way Classification. The test of significance 
between two independent estimates of a population variance may be applied 
in a great many types of experimental design. A general technique, known 
as the analysis of variance, was developed by R. A. Fisher for separating the 
experimentally observed variance into portions traceable to specific sources. 
The kind of procedure one attempts to follow in such an analysis can be illus- 
trated by the following scheme. 

Imagine a set of b families of which the Jfcth family contains Nk individuals. 
These families are subjected to different treatments Ti, T^, • • • Th, and a vari- 
able X is measured for each individual. The individuals may, for example, be 
plots of land, and the Ts may be fertilizer treatments. The x^s would then 
be yields of some specified crop grown on all the plots. Or the families may be 
batches of steel ingots containing slightly different amounts of some ingredient, 
and the x^s the result of tests on the breaking strength of the metal. We have, 
then, N ~ independent values of x, classified into b columns as in 
Table 17. 

Table 17 



■ 

- T, 

Xu 

X 12 * 

* Xib 

X 21 

3/22 * 

* * Xib 

XNi,l 

XN2y2 

XNb,b 


Let x,k denote the mean of the Ath family and x the overall mean. Then 


(9.1) 

and 

(9.2) 


N kX.jc “ ~ 1 , 2 , • * * & 


Nx = ^Xjk 

Now the variance of the whole set of x^s in Table 17 is Q/iV, where 

(9.3) Q - ^ix,k - xr = ^x,k^ - Nx^ 

and this can be split up into two sums of squares 

(9.4) Q ^ qi + qt 

238 
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where 

gi = - xY = - Nx^ 

ff2 = ^(x,k - x.,y = 

J,t k J 

To show that (9.4) is an identity, we write 

^{Xjk “■ X)^ = ^(Xjk — x,k + x.k — xY 

== ^{Xjk - - xY + 2^(Xjk - x.k) (x.k - x) 

The last term vanishes, since by (9.1) 

^(Xjk — x.k) = 0 

i 

Moreover, 

'2,{x.^ - xY = - Nx^ 

= i(lY>^kY/Nu - (^,kY/N 

^ j 

whence the required result follows. 

The quantity qi measures the variability between families averaged over the 
individuals in each family, while ^2 measures the remaining part of the vari- 
ability which cannot be explained by differences of treatment between families. 
This part of the variability is commonly attributed to '^error.^^ 

If the families are all the same size, so that Nk = a, say, 

(9.6) qi = a^(x.k - x)^ 

k 

The quadratic forms Q, gi, are commonly called sums of squares (SS). 
The name ^'squariance,” on the analogy of variance, has been proposed by 
Pitman and will often be used in this chapter. M. G. Kendall has suggested 
the term deviance.’^ 

The mathematical modeP underlying our treatment is as follows. The 
total response Xjk of the jth individual to the fcth treatment is made up of an 
overall effect /x, a part fik characteristic of the fcth treatment, and a part €,& 
which can be regarded as error. These parts are supposed to be additive, so 
that 

( 9 . 7 ) Xjk — u + Pk €]k 

We can imagine fx adjusted so that = 0. We assume also that each 
€jk is an independent* random variate with expectation 0, independent not 
only of the other e's but also of the /3^s. The hypothesis to be tested is that 
the pk are all zero, or in other words that the treatments do not effectively 
differ from each other. 

* It is sufficient for some purposes to assume that the €jk are uncorrelated, but if they are 
normally distributed zero covariances imply independence. 
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Often it is necessary to make special experimental arrangements to ensure 
approximate independence of the «,*. Thus the individual plots of land 
receiving the same treatment are not side by side but randomly arranged 
in blocks. 

A further assumption we will make is that the e,* are normally distributed 
about zero with a common variance <r®. K so and if our hypothesis is true, 
the x,}c are independent normal variates and (see § 7.6) '^ix,k — is 

distributed as with — 1 = iV — 1 df . Moreover, for each value of h, 
'^ix,k — is distributed as x’ 'with JVt — 1 df, and since the various 

3 

columns in Table 17 are independent, we can add the x“ for these columns. 
Hence — x.fc)Vv* is distributed as x® "with — 1) = iV — 6 df. 

Since, therefore, Q and qi in Equation (9.4) are distributed as xV* with 
i\r — 1 and iV — 6 df respectively, it follows from Fisher’s Theorem (§ 5.4) 
that ffi is distributed as xV* with 6 — 1 df, independently of qt. Since the 
expectation of x“ with v df is equal to r, it follows that E[qi/{h — 1)] = <r®, 
— &)] = and E{Q/{N — 1)] = or in other words qi/(b — 1), 
qi/(N — b), and Q/iN — 1) are all imbiased estimates of o-*. The first two 
of these are independent. The ratio of these estimates is distributed as F 
with 5 — 1 and JV — 5 df , and hence we have a convenient test for the signifi- 
cance of any apparent treatment effects. 

The results may be summarized in an analysis of variance table, such as 
Table 18. 


TabIiE 18 


Sum of Squares 
(SS) 

Degrees of 
Freedom 
(df) 

Mean Square 
(MS) 

Between families 

? 

h - 1 

?:/(5 - 1) 

Within families 

- *■*)* 
]k 

N -h 

q^/{N - b) 

Total 

Jk 

N - 1 

Q/(N- - 1) 


The first two columns are additive, but not the third. The namA “mean 
square” is given to the squariance divided/ by the degrees of freedom. All 
these mean squares are estimates of the population variance. 
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Example 1. To test the effect of a small proportion of coal in the sand used for making 
concrete, several batches were mixed under practically identical conditions except for the 
variation in the percentage of coal. From each 'batch some cylinders were made and tested 
for breaking strength (Ib/in^). The results were 


Percentage of Coal 

0 

0.05 

0.1 

0.5 

1.0 

1690 

1550 

1625 

1725 

1530 

1580 

1445 

1450 

1550 

1545 

1745 

1645 

1510 

1430 

1565 

1685 

1545 


1445 

1520 

Mean 1675 

1546,2 

1528.3 

1537.5 

1540 


One of the cylinders containing 0.1% was defective, so that there are only three indi- 
viduals in this family. We ifind Q = — (^x)V19 = 165,918, = 59,257, 

^2 = 106,661.* The analysis of variance is, therefore. 



SS 

df 

MS 

Between families 

59,257 

4 

14,814 

Within families 

106,661 

14 

7,619 

Total 

165,918 

18 

9,218 


F » 14,814/7619 = 1.94, with 4 and 14 df. The 5% point is 3.11 and the 1% point 5.03, 
so that this value of F is clearly not significant. The admixture of coal does not, as far as 
this experiment goes, affect the breaking strength. 

There are three estimates of the variance of the parent population, given in the last 
column of the analysis of variance table. On the null hypothesis the third one is the most 
reliable, being based on the largest number of degrees of freedom, although we usually use 
the second, as it is valid even when the null hypothesis is not true. 

The standard error of the mean of samples of n is, therefore, estimated as (7619/w)i^2. 
The significance of the difference between two family means may be estimated by the sJ-test, 
but it must be remembered that even if we really have several groups drawn at random 
from the same population, the difference between the largest and smallest group means may 
well appear significant as judged by a test appropriate only to two random samples. 

9.2 Two-way Classification, with One Individual in Each Sub-class. In- 
stead of regarding the individuals in a column of Table 17 as mere replicates 
of one another, subject only to random variation, we may be interested in a 
possible significant variation between the individuals. We may have, for 
example, h varieties of sugar beet, to be tested for sugar yield, these varieties 
corresponding to the b treatments of the Table. If each variety is grown on 
a plots of land, we have the possibility of distinguishing experimentally be- 

* The physical dimensions of qi and (lb® in”*) are omitted for convenience. The quan- 

tity F, being a ratio of two variances, has of course no dimensions. 
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tween the variations in yield due to the difference between varieties and the 
variations in yield due to soil factors (fertility, moisture, etc.)* In order to 
do so, we group the plots in blocksj in the simplest case h plots to a block, and 
arrange that each variety is grown on one and only one plot in each block. 
There may, and often will, be well-marked differences between blocks due to 
gradients of soil fertility in the field, but if the different varieties are assigned 
to plots within a block in a truly random manner the effect of block differences 
can be separated completely from the effect of variety differences. The 
ab (= JV) values of the variate x are now classified into a rows (representing 
blocks) and b columns (representing varieties). 

Let Xjk be the value of x in the ^‘th row and fcth column. Let Xj. be the mean 
of the jth. row and x.k the mean of the fcth column, and let Q be the total 
squariance. Now Q can be resolved into three quadratic forms as follows: 

(9.8) Q = + ^2 + §3 

where 

a 

qi = — x)^ 

1 

& 

q2 = a^ix-k — x)^ 

g & 

^3 = — Xj, - Xk + xY 

VI 

That (9.8) is an identity in the N ^ ab values of x can be readily seen as 
follows: 

a i ad 

— x)^ = — x-k + x) (Xj. — x) + {x.k — 5)}* 

11 1 1 

a d a & 

= — */• - x.k + x)^ + 

,11 11 

+ '^^(x.k — xY 
1 1 

To show that the cross-product terms vanish consider the term 

g & 

“• + X){Xj. — X) 

This becomes 



A similar demonstration can be made for the other cross-product terms. 
This is left as an exercise for the student. Since 
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(X d a 

- xy = - x)^ 

r i; 1* 

a b b 

~ = Cb^ix.je — 

1 1 1 

(9.8) is established. 

The variability between rows is measured by qi and between colum^is by ^ 2 . 
The residual variability, freed from the influence of either rows or columns, is 
measured by gs. On the assumption that the row effects and the column 
effects are independent of each other, measures the ^'experimental error’' 
inherent in the experiment and over which no control is attempted. If blocks 
are distinguished by rows and treatments by columns and if there is a differ- 
ential response of individuals in certain blocks to treatments, markedly 
different from the average differential response over all blocks to the same 
treatments, there is said to be interaction between blocks and treatments. 
Such interaction, if it exists, will be lumped mth the experimental error, and 
the usual mathematical model ignores it. 

The mathematical model is now 

(9.9) iTjA: = M + ft + 

where the part is characteristic of the jth individual and ft of the ^;th treat- 

ment, and where 

If we suppose that the €,* are all normally and independently distributed about 
zero with the same variance and if all the a, and ft are zero, then Q is 
distributed as with iNT — 1 df. Also x/. is normally distributed with 
variance cr^/fe so that qi is distributed as with & — 1 df, and similarly q^ 
is so distributed, independently, with a ”■ 1 df. It follows that qz is dis- 
tributed as with iV — a — & + 1 = (a — 1) (6 — 1) df, independently of 
qi and ^ 2 . The analysis of variance is as shown in Table 19. 


Table 19 


Variance due to 

df 

SS 

MS 

Rows 

a — 1 

a 

?i = - sy 

1 

5i/(a - 1) 

Columns 

h - 1 

b 

1 

?i/(6 - 1) 

Interaction 

{a - 1)(6 - 1) 

ga = Q -‘gi — 

g,/(a - 1)(6 - 1) 

Total 

oh — \ 

a b 

Q = - sy 

1 1 
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The quantities in the 4th column are all independent unbiased estimates of 
0 ^. The quotient Q/ (ab — 1) is also an unbiased estimate but is not inde- 
pendent of the others. Under the null h 3 q)othesis that there is no significant 
variation between individuals, the quantity 

(9.10) F = ~ 

qz 

has the F-distribution with 6 — 1 and (a ~ 1)(6 — 1) df. Under the null 
hypothesis that there is no significant variation between treatments, 

(9.11) F = (a - l)q^/qz 

has the F-<listribution with a — 1 and (a — l)(b — 1) df. 

Example 2. On a feeding experiment a farmer hf»s four tj^es of hogs denoted by I, II, 
III, IV. These types are each divided into three groups which are fed varietal rations A, 
B, and C. The following results are obtained, the numbers in the table being the gains in 
weight in pounds in the various groups. 



I 

11 

in 

IV 

Totals 

A 

7.0 

16.0 

10.5 

13.5 

47.0 

B 

14.0 

15.5 

15.0 

21.0 

65.5 

€ 

8.5 

16.5 

9.5 

13.5 

48.0 

Totals 

29.5 

48.0 

35.0 

48.0 

160.5 


The computations yield the following results; 


Sum of Squares 

df 

Unbiased Estimates 

Rations 

54.1250 

2 

27.06 

Types 

87.7292 

3 

29.24 

Eesidual 

28.2083 

6 

4.70 


To test the significance of the variation in rations we refer F = 27.06/4.70 - 5.76 to 
Snedecor’a table where, corresponding to (2, 6) degrees of freedom, we find 5.14 for the 
5% point and 10.92 for the 1% point. Similarly, to test the significance of the variation 
between types we compute F = 29.24/4.70 = 6.2. The entries in the table for (3, 6) degrees 
of freedom are 4.76 for the 5 % point and 9.78 for the 1 % point. Our conclusion is that there 
is a significant difference between breeds and between varieties of rations at the 5% point, 
but that neither is significant at the 1% point, 

9.3 Interpretation of the Mean Square. Suppose that in the one-way 
classification of § 9.1 the null hypothesis that the Pk are all zero is untenable. 
The true mean of the fcth family is then p -f Pkj with = 0. Even if the 
null hypothesis is not rejected at the chosen level of significance, we cannot be 
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sure that the fik are really all zero. The expected values of qi and of Q then 
include terms depending on the ft, although, as we shall see, the mean square 
within families still provides a valid estimate of 

E(qi) = E'^Nic{x i — /X — ft — (5 — /«) 4- ft}“ 

= E'^Nk{x I — /X — ft — (x — /i) !' 4- ^^iVift^ 

Since x.k — ti — ft has a mean x — n and variance cr^/Nk, the expectation of 
'^^kix.k — /i — ft — (S — is (& - l)<r^ Hence 

£(gi) = (6 - IW + 

The expectation of the mean square between families is, therefore, always 
greater than unless all the vanish. In the same way 

E{Q) = E^{x^k — ji — Pk — (x — m)}^ + 

= (iV - l)a^ 4- 

It follows that 

E(q2) = iN- b)a^ 

The null hypothesis is tested by finding the value of f == [E(qi)/E(g 2 )] X 
[(iV — b)/(b — 1)], which should be equal to 1 if all the pk vanish. Note that 
in using this i^-test we are testing the null hypothesis that F = 1 against the 
alternative hypothesis that F is greater than 1. Hence, as already indicated 
in § 7.16, the 5% point, for example, really does provide in this case a 5% level 
of significance. 

The assumption that the population variance is the same for all the families, 
even though their means differ, appears rather artificial. There are, however, 
many situations in which it may not be unreasonable. A fertilizer treatment, 
for example, may cause a marked change in the yield from each of a number of 
plots, without much affecting the variability from plot to plot. 

In the two-way classification, given by 

M + a, + ft + €jk 

J t 

E(qi) = bE^ix,. - xy 

3 

= bE'^{xj. — /I — a, — (J — m) + “jP 

= bE'%{x,. — fj. — a, — (Jc — iu)p 4- b^ot,^ 

= (a - l)<r* 4- 


where 
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In the same way 

E{qi) = (& - 1)<F^ + 

and 

E{Q) = {N - 

k J 

SO that E(qz) — {a — l){b — Hence it is still true that the residual 

mean square provides an unbiased estimate of the population variance, al- 
though the other mean squares contain additional components, as set out in 


Table 20. 

Table 20 

Mean Square Expectation 

Rows qi/(a — 1) + h^aj^/(a — 1) 

Columns ? 2 /(& — 1) — 1) 

Interaction 53/(a — 1)(6 ~ 1) a* 

(Residual) 

Totel Q/iN - 1) <t2 + + b'^cci‘)/(,N - 1) 


9 A Three-way Classification (One Member in Each Sub-class). Let us 
suppose that our material is classified into A-, B-, and C-classes, in number 
a, 6, c, respectively, and that the value of the observed variate corresponding 
to the sub-class A^BkCi is x^ki- The mean value of x over the -classes for 
fixed Bk and Cz is x.kh the mean value over the A- and 5-classes for fixed Ci is 
X. .1, and the general mean is x (written for short instead of . . .). Then, on 
the hypothesis of homogeneity and normality, we can prove by a direct exten- 
sion of the method already given for the simpler cases that the total sum of 
squares may be split into 7 components (7 = 2^ — 1) with degrees of freedom 
as given in Table 21. 

Table 21 
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The mean squares, given by dividing the SS by the df, are all, on the null 
hypothesis, unbiased estimates of and all except the last (for the total) are 
independent. If the null hypothesis is rejected, the residual mean square 
^ 7 / {(a — 1 )(& — l)(c — 1 ) } is an unbiased estimate of and so are the mean 
squares q^/{{a — 1)(6 - 1 )}, qt/{{h - l)(c - 1 )} and qt/[{c - l)(a - 1 )}, 
provided that we assume the additivity of the 4-, B- and C-effects, as repre- 
sented by the equation 

(9.12) Xjki = M + o:? 4" + 7z + ^3^1 

where as usual is supposed normally distributed about 2 ;ero with vari- 
ance cr^. 

We can therefore use the F-test to see whether the ratios of these last three 
estimates to the residual mean square differ significantly from xmity. If all 
the interactions are non-significant, the sums of squares and the degrees of 
freedom may be pooled and the joint estimate of error used for testing the main 
effects. If one or more of the interactions are significant the mathematical 
model (9.12) must be modified. 

The general model involving interactions may be written 

Xjki = /X + a, + + 7z + ia 0 )jk + (0^7)7? + (fiy)ki + 

^vhere, for example, the term (a0)jk represents the effect of the AB interaction 
superimposed on the separate A and B effects, and where ^olj, '^^k, ^70 
^(oiP)jk, ^{a0)jk, etc., are all zero. 

j i 

The expectation of g4/{(a — l)Q> — 1)} may be shown, as in § 9.3, to be 
0-2 + c^i(x0)jk^/{{a — 1)(& — 1)}, and similarly for the other interaction 
terms. The expectation of qi/ia — 1) is, however, -h 'bc^aj^/{a - 1), so 
that the significance of the main effects may still be judged by comparison of 
the mean squares for these effects with the residual mean square. We shall 
see later that when the effects, including the interactions, are treated as in- 
dependent random variables this procedure is no longer correct. (See § 9.9.) 

When there is interaction the quantity q^ is no longer strictly distributed as 
Instead, 

c^{x3h - Xj.. - x.fc. + X — iaA)jk}^ 

is so distributed. The comparison of, say, qi with q^ by the i^-test is not 
justified. However, this kind of test is commonly made, on the ground that, 
if there are interaction effects, the main effects must be large compared with 
these interaction effects if they are to have any practical importance. 

Example 3. In an experiment at the Dominion Laboratory of Plant Pathology, Alberta, 
to determme the effect on the growth of wheat when the seeds are buried for a time in ground 
pitchblende (and so subjected to radiations) the variable measured was the average length 
of shoot in millimeters for 25 plants. The seeds were planted in four replicate blocks, 14 
plots to a block, the time of exposure varied from 1 to 7 days, and there was a complete set 
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of controls, treated m all respects like the experimental seeds except that they were not 
irradiated. The ^i-classes are irradiated (R) and non-irradiated (iV). The B-classes are 
exposure times and the C-classes blocks. The results are shown in Table 22. 

* TABIiB 22 


Exposure 

TreatmerU 

Blocks 

Tnijils 

(days) 

I 

JI 

III 

IV 



1 

R 

136.7 

106.1 

110.0 


487.8) 

905.4 

N 

96.6 

99.1 

109.2 

112.7 

417.6 j 


R 

140.9 

131.4 

142.7 

154.3 

569.3 ) 

1029.0 

L 

N 

117.8 

142.1 

92.1 

107.7 

459.7 J 


R 

149.0 

133.0 

126.0 

136.4 

544.4) 

1048.9 

6 

N 

148.5 

112.4 

139.0 

104.6 

504.5 J 

4 

R 

165.2 

152.3 

167.0 

145.7 

630.2) 

^ 1217.0 

N 

131.7 

160.8 

145.6 

148.7 

586.8 J 


R 

122.6 

151.3 

93.8 

110.7 

478.4 1 

^1009.6 

0 1 

N 

94.2 

147.1 

141.4 

148.5 

531.2 J 

6 1 

R 

161.2 

147.0 

158.7 ; 

150.2 

617.1 ) 

^ 1214.0 

N 

137,9 

149.7 

145.8 1 

163.5 

596.9 J 

pj 

R 

125.0 

144.4 ! 

138.4 

147.7 

555.5 ) 

i 1034.5 

i 

\ 

N 

109.9 

117.5 

125.5 

126.1 

479.0 J 

Totals 

i 






7458.4 

7458.4 


The sum for all the R cells is 3882.7, and for all the N cells 3575.7. Here a = 2, h ~ 7, 
c=4, so that the total number of df is 55. We find = 1,016,197— C, 

where C = (7458.4)V56 = 993,352, so that Q = 22,845. 

Since ^ i . . = 3882.7/28 and x 2 . . = 3575.7/28, 


qi = [(3882,7)* + (3575.7)*]/28 - C = 1655 

and similarly, 

go = [(905.4)* H + (1034.5)*]/8 - C = 9541 

ga = [(1837.2)* + • • • + (1891.8)*]/14 - C « 230.8 


The calculations of the interaction terms are a little more 
written 


qi 






complicated. 
4- ahc 5* 


== c * — gi — ^2 - C 

jk 

- i[(487.8)* + * • -f (479.0)*] - 11, 196 - C 
= 2029 

Similarly, since f a = (136.7 -f- 96.6) /2 = 233.3/2, etc., 

qr> « a - qt- qz - C 

= i[(233.3)* + ' • • + (273.8)*] - C - 9772 


= 4174 


Thus qi can be 


and, since = (136.7 + 140.9 + • * • + 125.0) /7 = 1000.6/7, 
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Se = ^ - 2» - Si - C' 

= ^[(1000.6)* + • • ■ + (911.8)*] - C - 1886 
= 798 

By subtraction, ~ 4417, so that the complete analysis of variance is as shown m Table 23. 

Table 23 


Va^'iation due to 

SS 

df 

MS 

Treatments {A) 

1,655 

1 

1655 

Exposures {B) 

9,541 

C 

1590 

Blocks (C) 

231 

3 

77 

Interaction AB 

2,029 

6 

338 

Interaction BC 

4,174 

18 

232 

Interaction CA 

798 

3 

266 

Residual ABC 

4,417 

18 

245 

Total 

22,845 

55 



It is clear that none of the interactions is significant, so that they can all be lumped to- 
gether with the residual to give a mean square for error of 11,418/45 = 254, with 45 df. 
Obviously there is no effect due to differences between the blocks. For the treatments, 
F = 1655/254 == 6.52. Since for rii = 1 and rii = 45, the 5% point for F is 4.06 and the 
1% point 7.23, there is apparently a significant treatment effect. For exposures, 
F = 1590/254 = 6,56; with ni = 6 and 712 = 45, the 5% point is 2.31, and the 1% point 
3.23, so that the effect of length of exposure must be regarded as highly significant. That is 
to say, the mere act of burymg the seeds m powdered rock, whether radioactive or not, for 
varying lengths of time apparently affected the length of shoot produced when the seeds were 
planted out. This is unexpected, as is also the absence of significant interaction between 
treatment and exposure. One would expect the effect of radiation, if significant at all, to 
depend on the length of exposure. 

It may be noted that, when we have several different estimates of a common variance, the 
chance that the largest ratio will be significant is considerably greater than would be given 
by the F-test. This test gives the probability that a random value of the ratio of two esti- 
mates will be exceeded. There is a danger of attributing significance to what, after all, may 
be just a sampling effect. 

9.5 Asstimptions Made in Analysis of Variance. The assumptions under- 
lying the usual techniques of the analysis of variance may be summarized as: 

1. Additivity of treatment effects and of environmental effects (such as the 
variation between blocks). 

2. Independence of all the experimental errors. 

3. Normality of the distribution of experimental errors. 

4. Constancy of the variance of the experimental errors, whatever the mag- 
nitude of the treatment or other effects. 

It is desirable to know whether the test will be seriously affected when these 
assumptions do not apply. The assumptions, of course, are not equally 
serious, but taken together they imply a severe restriction on the type of data 
to which the techniques of the analysis of variance are strictly relevant. In 
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practice the techniques are applied very widely, so that the conclusions draAvn 
must usually be interpreted with caution, 

Non-normahty, Attempts to obtain the exact distribution of F, under the 
null hypothesis, for random samples from a non-normal population have 
encountered serious mathematical difficulties. E. S. Pearson ^ obtained em- 
pirical distributions for samples or 500 of 1000 from six selected non-normal 
populations, but the samples were not large enough to fix the 5% points, let 
alone the 1% points, with much accuracy. Nevertheless, from these and some 
other experiments it appears that the ordinary F-test may be applied without 
serious error to most types of distributions that are likely to occur. Cochran^ 
suggests that a tabular 5% may mean anything between 4 and 7% and a 
tabular 1% anything between \ and 2%. As a rule the effect of non-notaality 
is to make results look more significant than they are. 

Unless the data are Yevj extensive, it is seldona possible to prove that they 
are not normal. The standard errors of skewness and kurtosis are so large 
that only very marked non-normality could be detected in a sample of 
moderate size. If there is reason to suspect non-normality, from the nature 
of the data, one may try one of the transformations mentioned in the next 
section. These, although intended to stabilize the variance, do, as a rule, 
improve the approximation to normality. 

The analysis of variance test has been considered by Pitman,^^ and inde- 
pendently by Welch, from a different standpoint altogether, and in this form 
the test involves no assumptions about the normality or otherwise of the parent 
population. Their approach is as follows. 

In the simple case of complete randomized blocks, with a blocks and h 
treatments, the treatments are allocated at random among the h plots in a 
block. If Xij{k) is the yield of the jih plot in the ith block (the fcth treatment 
being applied to this plot), the null hypothesis is that is independent of 
k. This means that any one of the treatments would produce the same yield 
on a particular plot, so that there is no treatment effect. The various yields 
within each block might, therefore, be rearranged in all the b\ possible permu- 
tations and all these would on the nuH hypothesis be equally likely. The 
whole set of yields actually obtained is regarded as one of the (& !)“ possible and 
equally likely rearrangements among the experimental plots. This is quite 
different from the classical ppint of view, in which the yields in a given block 
(on the null hypothesis regarding treatments) are considered as random 
samples from a hypothetical infinite population of yields having a normal 
distribution. 

If we calculate the sums of squares and write 
. S == Sb “h St d" Se 

where S is the total SS and Sbj Stj Se are the SS due to blocks, treatments, 
and error respectively, then, on the ordinary theory, the ratio 
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W — St/ {St + Se) = St/{S — Sij) 

has a beta-distribution with parameters (b — l)/2, and (a - 1)(6 — l)/2. 
This is equivalent to saying that (a - l)W/{l - W) has the F-distribu- 
tion. 

From the Pitman and Welch point of view, the observed W is the result of 
the chance allocation of treatments within a block. In the various permuta- 
tions that give rise to the different possible values of W the sum of squares 
within blocks remains unchanged. Hence St + Se is constant, so that the 
distribution of W is that of St- 

It was shown by Pitman that W has the same expectation a~^ as on the or- 
dinary theory, but its variance depends on the different within-block vari- 
ances. If all these variances are equal (as will be the case when the variates 
are ranks), the variance of W is 2(a — l)/{a^{b — 1)}. The variance of W 
on the ordinary theory is 2(a — l)/{a^(6 — 1) -H 2a^}j which is not very 
different if a and h are fairly large. The distribution of W is in fact 
quite close to a beta-distribution with parameters p = |(& — 1) — a“h 
q i(a — l){b — 1) - (a — l)/a. 

Correlation Between the Errors, Suppose that (i = 1, 2 • • • a) are the 
errors of the individual observations on a single treatment, and that these have 
a common variance cr^ and a correlation coefficient p between each pair. Then 
Var (^€ 1 ) = a<T^ -f- a{a — (see Problem 20, Chapter IV), so that the 

true variance of the treatment mean will be or2[i -|- (^ — l)p]/a. The estimate 
of this variance given by the usual method is equal to the sum of squares 
within the family divided by a{a — 1). This sum of squares is "^{h — i)^ = 

Now the expectation pof is and the expectation of ai^ is 

so that the expectation of the analysis of variance esti- 

i 

mate is <t^{ 1 — p) /a. This is less than the true value by po-^. Hence, if p is 
positive, the treatment mean is less accurate than it is estimated to be. If p 
is negative, it is more accurate. In either case there is a bias in the estimation 
of the variance. The actual situation is, of course, more complicated than in 
this simple example, but the general effect of correlation is evident. 

Proper randomization will usually remove this difficulty. The plots corre- 
sponding to various treatments are laid out in blocks, for example, so that any 
one treatment is scattered at random over the blocks. Effects due to fer- 
tility gradients within the blocks are thus largely eliminated, and the errors 
may then be treated as though they were independent.^ 

Non-additivity, A reasonable alternative to supposing the effects additive 
is to take them as multiplicative. For example, in the two-way classification, 
Xjk = + €,jb). If we suppose that the tjk are all zero, the method of 

analysis of variance described in § 9.2 will give an apparent error variance 
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qs/ia - i)(b - 1) ^ ^ 

= “■ — 1)(6 ~ 1 ) 

When the e^k are not zero, the error variance will be increased by an amount 
due to non-additivity. Unless the error variance is small or the treatment and 
replication effects are both large this increase will generally be negligible. 

If there is reason to suspect non-additivity, it may be well to transform the 
variables as mentioned in the next section. 

Non-uniformity of Variance. The effect of differences between the variances 
will be to reduce the sensitivity of tests of significance and to increase the un- 
certainty in the estimation of treatment effects. If a pooled error is used for 
;5-tests between two treatments, the result may be seriously in error. 

If all the error variances are known, the observations may be weighted, 
each being given a weight inversely proportional to its error variance. In 
practice, the error variances are not known, but it is sometimes possible, when 
the data are obviously heterogeneous, to separate them into parts for each of 
which a variance may be estimated. 

Another situation arises when the variate x has a distribution (say of the 
binomial or Poisson type) for which the variance is a function of the expecta- 
tion. If there are real treatment effects, the variance will clearly not be con- 
stant as between treatments. The remedy in such cases is to make a suitable 
transformation of the variable. 

9.6 Transformations to Stabilize Variance.® Suppose the original variate 
X has a distribution for which the expectation is E{x) = m and the variance 
is Var (x) = f(m),f{m) not being a constant. We wish to find a new variable 
yj a function of x, for which the variance will be independent (or nearly 
independent) of m. 

Let y — <t>ix). Then for small variations of x around m, we have approxi- 
mately,^ by Taylor^s Theorem, 

y = <^(m) + (x — 

where is the derivative of (j>. Since J?(x — m) = 0, it follows that E{y) — 
and 

Var {y) = E[(x — m)<t>'(m)Y 
= W(rnW Var (x) 

c^, where c is a constant. 


If the distribution of x is hinomialj x being, say, a proportion of successes in a 


4>(rn) 


.cj\ 




Hence, if Var (y) = 

so that 
(9.13) 
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given number s of trials, m = $ and/(m) = 0(1 - e)/s. Therefore, ignoring 
the constant, 

(9.14) <^(x) = sin“^ 

This is known as the angular transformation. A convenient table giving 
the angles 4>{x) for different values of x (expressed as a percentage) is given by 
Snedecor.^ The transformation works quite well except at the extreme values 
of X. Bartlett has suggested that the ratios 0/s and s/s should be counted 
respectively as l/4s and (s — l/4)s. This transformation improves the ap- 
proximation to normality, which is very poor for the binomial distribution 
mth moderate values of s and d near 0 or 1. The approximate variance is 
1/ (4s) if <^(x) is in radians or 821/s if (l>(x) is in degrees. 

If the distribution of x is of the Poisson type with expectation m, f{m) = m, 
and <;>(m) = 2c Hence the transformation is 4>{x) = Bartlett has 

shown that still better results are given by 

(9.15) <l>{x) = (a; + 

where the | is added as a sort of correction for continuity. This is usually 
called the square root transformation. The variance is fairly constant for 
m > 3, being approximately 0.25. 

For many biological populations the standard deviation is approximately 
proportional to the mean, that is, /(m) = so that 

(9.16) (t>{x) = log X 

This is the logarithmic transformation. It will convert multiplicative effects 
into additive ones. The empirical transformation 

(9.17) <l>{x) = log (1 + x) 


avoids the difficulty of applying (9.16) when x happens to be zero. The 
variance is approximately or 0.189fe^, according as the logarithms are to 
base e or to base 10. 

The Fisher transformation (see § 8.15) may be included here. If f{m) = 
fc(l — m^Yj then by (9.13), <i>{m) = | log [(1 + w^)/(l — m)]. This is the 
same as (8.70), 


as used for transforming sample correlation coefficients. The transformation 
achieves approximate normality as well as approximate constancy of variance. 
It is not, of course, needed very often in analysis of variance problems. (See 
§ 9.18.) • * 


Example 4 (Bartlett). Table 24 gives the number of wheat seeds out of 50 which failed 
to germinate under different treatments, each with 4 replications. Treatments 6 and 7 we*'e 
actually identical. 
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Table 24 



Treatments 

Rephcattons 

1 

2 

3 

4 

5 

6 

7 

1 

10 

11 

8 

9 

7 

6 

9 

2 

8 

10 

3 

7 

9 

3 

11 

3 

5 

11 

2 

8 

10 

7 

11 

4 

1 

6 

4 

13 

7 

10 

10 

Totals 

24 

38 

17 

37 

33 

26 

41 


The distribution being presumably binomial, but with B possibly varying from treatment to 
treatment, the angular transformation is indicated, namely, y = sin”H2a;/100)i^2. The 
values of y (in degrees) are given in Table 24A. 

Table 24A 



Treatments 

Replications 

1 

2 

3 

4 

5 

6 

7 

1 

26.6 

28.0 

23.6 

25.1 

22.0 

20.3 

25.1 

2 

23.6 

26.6 

14.2 

22.0 

25.1 

14.2 

28.0 

3 

18.4 

28.0 

11.5 

23.6 

26.6 

22.0 

28.0 

4 

8.1 

20.3 

16.4 

30.7 

22.0 

26.6 

26.6 


The analysis of variance in the new variable is 

t. 



SS 

df 

MS 

Between treatments 

361.5 

6 

60.25 

Within treatments 

460.2 

21 

21.91 

Total 

821.7 

27 



Hence F = 2.75. Since with Ui — % and = 21, the 5% point is 2.57 and the 1% point 
is 3.81, the value of F appears barely significant. The estimate of variance 21.9 from the 
MS within treatments is greater than the value 821/50 = 16.4 which we should expect if 
the variability were really binomial. 

9.7 Tests of Homogeneity of Variance. If we have sufficient degrees of 
freedom to estimate the separate variances for different treatments with 
reasonable accuracy, w§ can apply certain tests to determine whether the hy- 
pothesis of a constant variance from treatment to treatment is acceptable. 

Suppose we have 6 samples, with numbers iVi, • • • Nb and squariances 
Su &, * * * Sby each drawn from a normal population with unknown mean and 
variance. The hypothesis to be tested is that these variances crk^ Qc i, 
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2 , • • • &) are in fact all equal to o-*. On the null hypothesis the quantities 
iSjt/<r® are distributed independently as x* with n* (= — 1) degrees of free- 
dom, and <S/o-* = is distributed as x^ with n i=^nk) degrees of 

freedom. -The joint probability density of the Sk is, therefore, 

(9.18) P = (2<r^)-*f[fkiSk/2,T^) 

h=>l 

where and the probability density of S is 

(9.19) Po = (2<r2)-i(5/2(r2)(«/2)~ie-^/2«ryr(n/2) 

In order to test the homogeneity of the variances, we calculate the condi- 
tional probability of getting the observed squaiiances Sij Szj • • • if the 
total of these sguariances is fixed. That is, we determine Pr{/Si, • • • /S 5 1 iS}, 
which is equal to Pr{/Si, * • • ^S^I/PrfS} or, apart from differentials, to P/Po* 
Now P and Po contain <r^. The ratios Sk/<T^ and S/(x^ will be independent of 
the units in which the variable is measured, but this will not be true of the 
factor If, however, we take as our variable log>Sfc mstead of Sk this 
difficulty will be removed. If yk == log/S*, dyk = dSk/Sk, so that the joint 
probability density of the y* is 

P' = T[UiSk/2a^) 

k 

vrh.eTef'k(x) = x’^'‘'h~’’/T{^nk), and the probability density oi y = log S is 
Po' = (S/2<rY'^e-^'^/Tin/2) 


Hence the conditional probability density is 


(9.20) 


p, r(„/2) 

p.' “ nr("*/2) ' 


The likelihood is the logarithm of this, namely, 

L = ^huk log Sk - ^k/2<r^ - hn log S + S/2x^ + C 
= 2^”'* Sk — inlogS + C 

since S = and n = The maximum value of L for different possi- 

ble values of >Si, Si, ■■ ■ Si, is given by putting SL/dSk = 0, k = 1, 2, ■■■ b. 
That is, ^Uk/Sk — n/2iSi = 0, or Sk/S = rtk/n, which is equivalent to Sk = cw*,. 
The maximum value of L is, therefore, 

PmM = 'Xi'nk log Uk — |n log » + C 
If we take as our statistic to measure homogeneity of variance 
M = 2(L,n« - L) 

= -'^rik log (Sk/rik) + n log (S/n) 


(9.21) 
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M will be non-negative and independent of the unknown variance cr^, and will 
be a measure of the probability of not getting the observed set of squariances 
iSi, S 2 , • • • Sh for 6 independent samples from a common population, imder the 
condition of a fixed value for the total S = If M is sufficiently large 

the nuU hypothesis of a common is rejected. The logarithms are to base e 
and so are equal to the common logarithms multiplied by 2.303. 

It may be noted that S/n is a pooled estimate of the variance cr^ based on 
the total variation within samples, while the Sk/nu are separate estimates of 
this variance based on the individual samples. IfK= 

k 

(9.22) = {J^{SkInC)^'-Y'''/{^nk{Sk/n,^/n\ 

k k 

which is a ratio of the weighted geometric mean of the Sk/nk to the weighted 
arithmetic mean, the weights being the degrees of freedom nfc. 

It is necessary in order to apply the test to know the distribution of M, It 
was proved by Bartlett® that M is approximately distributed as with 
J — 1 df , if the Uk are fairly large. More exactly M/c is so distributed, 
where c is a correction factor 

but even with this correction the approximation is not entirely satisfactory 
when some of the n* are 3 or less. 

Hartley^ has given a still better approximation, useful even when some of 
the Uk are down to 1 or 2. Tables based on this approximation have been 
compiled by Catherine Thompson and M. Herrington.® These tables give 
5% and 1% points of the distribution of M, for given values of b and ci, where 

(9.24) Cl = '^{l/uk) - Ijn 

Each table gives two entries, denoted by (a) and (b). These correspond to 
limiting values of a parameter cs = — l/n^- The true 5% (or 1%) 

point will usually lie near (a), especially when the Uk are nearly equal, but in 
cases of doubt a separate table is provided to facilitate interpolation. 

The criterion Li used by Neyman and Pearson® was practically the same 
as K of (9.22), except that the sample numbers Nk were used instead of the 
degrees of freedom. Tables of Li, for equal sample sizes, were calculated by 
P. P. N. Nayer,^® and are reproduced in various books. These tables may also 
be used when the Nk vary, by using an average value, provided that none of 
the Nk is less than 15 or 20. 

Example 5. For the data of Example 1, the estimated variances for the different sets of 
concrete cylinders are given m column 4 of the following table: 
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Percentage Coal 

nk 

Sk 

Sk/nk 

nk iogio (Sk/nk) 

0 

3 

14,250 

4,750 

11.030 

0.05 

3 

20.019 

6,673 

11.473 

0.1 

2 

15,817 

7.908 

7.796 

0.5 

3 

65,425 

18,475 

12.800 

1.0 

3 

1,150 

383 

7.750 


14 

106,661 


60.849 


Using Bartlett’s test, we have M - 2.303 [14 log, ol06, 661/14 - 50.849] = 8.056 anl 
c = l + + A)/12 = 1.147, 80 that M/c = 7.02. With 4 df this value of corre- 

sponds to a F of about 0.13, so that the differences of variance, large as they appear at first 
sight, are not significant in view of the small number of degrees of freedom. 

Using Thompson and Merrington's tables, we arrive at the same result. With ci 1.762 
and 6=6, the 5% point for the distribution of Jf is about 10.7 and the 1% point about 14.9, 
so that the observed M of 8.06 is not significant. 

The degrees of freedom are too few for much reliance to be placed on Nayer^s tables of Lu 
The value of Li is the ratio of the weighted geometric mean of the Sk/N'k to the weighted 
arithmetic mean, that is, 0.5698. The 5% point and 1% point corresponding to iV = 4 are 
0.491 and 0.370 respectively, again suggesting the non-significance of the observed Li, 


9.8 The Behrens-Fisher Test. The usual ^-test for the difference between 
the means of two random samples assumes that these samples come from 
populations with a common variance. When the variance is different in the 
two populations a modification of the test is necessary. Behrens, and later 
Fisher, suggested a test which depends only on the sample means and vari- 
ances. 

Suppose we have two samples of JVi (= ni + 1 ) and Nz th + 1) from 
normal populations with means pi? M2 and variances cri^, If xi, X2, niSi^, 
are the sample means and variances, the quantities h = (^i — mO/^i and 
^2 = (dfe — M2 )/s 2 are independently distributed as Student^s t, with ni and 
degrees of freedom respectively. 

Let us define a quantity d by 

d = [xi — X2 ~ (mi — M2)]/ (si^ + 

= (sdi “ S2fe)/(si^ + 

( 9 . 25 ) = h sin 0 — ^2 cos 6 


where tan d = si/s2- Then d clearly depends only on the difference of the 
true means and on known quantities. It is independent of the unknown 
population variances and 0-2^. 

Now if /n (0 is the frequency fxmction of the ^-distribution with n degrees of 
freedom, the probability that the two samples will have values of <1, and fe, 
respectively, lying within specified limits, will be * 



fnXtl)fn,it2)dtldk 
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integrated over the appropriate region of the plane. Let the region be 
specified by 

sin ^ — fe cos 0 > do 


and let us choose do so that the integral is equal to a fixed quantity a/2, say 
0 025. Geometrically, this region is the part of the iife plane l 3 dng below the 

line Li, wliich has the equation 

^ U = ti tan ^ — do sec d 

The quantity do is the perpen- 
dicular distance from the origin 
to Li (Figure 28) . Because of the 
S5nnmetry of the distributions of 
ti and fe, the region correspond- 
ing to d < — do, which is the part 
of. the plane above the line 
(parallel to Li and equidistant 
28 ^be origin on the opposite 

side), will also give an integral 
equal to a/2. The probability, therefore, of the point {h, k) lying between 
the lines Li and L 2 is 1 — a, so that 



ff 


'ti tan d-fcfo sec (9 


Snzik) fnXh) dh dti = 1 — a 


'^itan d-' da sec 6 


or, putting u = U — k tan 6, 

n do sec 6 

'frkiu + h tan d) dufn,{ti) dh = 1 — a 

do sec d 

From (9.26), for fixed values of d, 712 and ni, we can calculate do corresponding 
to an assumed value of a. This was done by Sukhatme.^^ The relation 
d > do is equivalent to — (mi — M 2 ) > sdo, where s = (si^ + § 2 ^)^^®, 

and this may be written fii — fi 2 < xi — X 2 — sdo. Similarly the relation 
d < — do is equivalent to mi ~ At2 > ““ ^2 + ado. The region between the 

lines corresponds, therefore, to 


(9.27) Xi — X 2 sdo < fii — fJL 2 ^ Xi — X 2 + sdo 

so that we can consider this relation as providing fiducial limits for mi “ j “2 
with confidence coefficient 1 — a. If — X 2 > sdo, the probability * that 
Ml "■ M 2 < 0 will be less than a/2, and if :ri — X 2 < —sdo, the probability that 
Ml ““ M 2 > 0 is less than a/2, We may, therefore, regard mi and M2 as sig- 
nificantly different at the level a if | — :r 2 1 > sdo. 


This is a 'fiducial probability^' in Fisher's sense, not an a priori probability as used in 
Bayes’ Theorem. See the discussions in § 6.3 and m § 12.11 . 
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Tables (due to Sukhatme) for the application of this test are given in 
Statistical Tables by Fisher and Yates, Table V 1). These give the 5% and 
1% points of do for values of 6 at 15° intervals and for ni and = 6, 8, 12, 24 
and 00 . 


Example 6 The mean of 12 observations of a certain quantity is 4.774, with a standard 
Cl lor of 0.0094. The mean of 20 observations by a different method is 4.744, with a standard 
eiror of O.OOaS. Are these means significantly different? 

The F-test for the ratio of the two variances gives 


12 (0.0094)g 
20 (0.0038)2 


= 3.G7 


The 5% point is 2.34 and the 1% point 3.36 so that the difference of variance is quite defi- 
nitely significant. 

To apply the Behrens-Fisher test, we calculate 0 — tan~i 94/38 = 68°, rii - ll^nt - 19, 
s = .0101. The nearest values of ni and m the tables are 8 and 12, 12 and 24 respectiveh', 
and for each combination we must interpolate between 6 = 60^ and 75°. The values of do 
for the 5% level and d — 68° are 



8 

12 

12 

2.278 

2.172 

24 

2.263 

2.156 


We now mterpolate harmonically* for % = 11, giving values 


W 2 

do 

12 

24 

2.191 

2.175 


A final interpolation for ^2 = 19 gives do =2.179, whence sdo =0.022. Smce :ri—:g 2 = 0.030, 
the difference is significant at the 5% level. At the 1% level, we find in the same way that 
do = 3.034 and sdo = 0.031. Hence the observed value of ~ is practically significant 
at the 1% level. 

If we applied the ordinary /-test, disregarding the differences in variance, we should get 
, ^ f 32 132(0.0094)2 + 380(0 0038)2 ) “I /2 

36 J 

= 3.44 

with 30 df. This corresponds to a probability of a little more than 0.001, so that the /-test 
would here overestimate the significance. 

There has been a good deal of discussion over the validity of the Behrens- 
Fisher test.^® It is not true that (9.27) will hold jn repeated sampling in a 
proportion 1 — a of trials, so that the ^'fiducial interval” is not strictly a con- 
fidence interval. The justification for integrating (9.26) with B constant is 

*Tsing the reciprocal of rii instead of ni. Thus 2 278— 0.1 06(1 — -/f)/(,i —iV) =2.191. 
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not obvious, since fixed values of si and S 2 , the quantities h and t% have a 
normal distribution and not that of Student/^ However, the test seems to 
be valid on Fisher^s theory of fiducial inference which is logically distinct from 
the Neyman-Pearson theory of confidence intervals, although in many 
problems the two theories give identical results. (See § 12.11.) 

An approximate test for the difference of means of two populations with 
different variances has been suggested by Cochran and Cox.^^ A weighted 
mean of critical t values for the two samples, weighted with the respective 
variances of the means, is calculated and compared with the observed 
i ■= {xi — (Note that sx and S 2 are the respective standard 

errors of Xx and x^, so that (sx^ + is the standard error of Xx — ^•) 

For the data of Example 6, if = 0.030/0.0101 = 2.97. For the first sample, 
vith ni = 11, the 5% point for t is 2.201, and for the second sample, with 
n 2 = 19, it is 2.093. The weighted tos is (2,201sx^ + 2m3s2^)/s^ = 2.186. 
Similarly, the weighted ^oi is 3.072. The observed t is, therefore, almost 
significant at the 1% level, confirming the result of the Behrens-Fisher test. 

If the two samples are equal in size, this method reduces to the ordinary 
West for the difference of two means, but with the number of df equal to n 
instead of 2n. 

9.9 Estimation of Components of Variance. In the usual mathematical 
model (9.7) underljdng the one-way classification, the quantities jS*, charac- 
teristic of the fcth treatment, are supposed to be fixed parameters which we 
desire to estimate. However, it is sometimes plausible to consider the fik as 
values of a random variable having a normal distribution with mean zero and 
variance these values being independent of each other and of the e,*. The 
latter are supposed independently and normally distributed about zero with 
variance 

Now, as we have seen in § 9.3, the jd*, as fixed parameters, can be estimated 
by X At ~ X, and each estimate is independent of the others. In our new 
model the Xjk are normally distributed about n with variance <r^ + but 
since aU the x,* with the same h have a common component Pk they are not 
independent. In order to express the probability of occurrence of the actual 
sample it is convenient to make an orthogonal transformation, as in § 4.13, 
namely: 


(9.28) 


^2/ub = (xiik + b Xaifc)/a^^2 = a^n^.k 

yuk = (xub - xjwfc)/2i^2 

yzk = (xufc + Xji* — 2x8jfc)/6^^^ 


[yak = [Xuk -!-••• + Xa^x,k — (u — l)Xo*]/[a(a — 1)]^^* 

m 

Since Xi* — X 2 jfc = fu ya* is normally distributed about zero with 
variance and so are yuj * • * yak) and these are all independent of each other. 
Moreover, ar^^^ik « /t + ft + and so is normally distributed about 
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fi with variance independently of the other y’s. By the properties 

of the orthogonal transformation, '^y,k^ = so that 

} 1 
a 

(9.29) = 2^’*^ ■" 

J = 2 3 S 

which is the sum of squares within the A:th family. The sum of squares within 

a 

families, is therefore equal and so is the sum of 5 (a — 1 ) 

kj^2 

squares of independent normal variates with mean 0 . Hence ^2/0’^ is dis- 
tributed as with h{a — 1 ) degrees of freedom, and 


(9.30) E(m^) = 0-2 

where m2 “ q2/\b(a — 1 )] is the mean square mthin families. 

Again, since yik = o}^^x.kj and since the sum of squares between families is 
given by qi — (i^(xk — xYj it follows that qi/io-^ + acrb^) is distributed as 

k 

with 6—1 degrees of freedom. Hence, if mi = qi/(b — 1 ) is the mean 
square between families, 

(9.31) Eimi) = 0-2 + a(T52 

so that (r 2 is estimated by m2 and o-b^ by (mi — m2) /a. 

The simultaneous maximum likelihood estimates of the parameters are not 
quite the same as these unbiased estimates, as we shall now show. 

The probability of the set of independent y^s is given by 

P = (2^)- a 6/2^-6( a - l )(^2 ^^^2) -5/ 2^-12: / 2 

where 

O’ i J =2 i ' 

= -^ + a(<T^ + a<Tb^)~^'^{x k — X + X — fiy 
Cf fc 

~ + O'hix — m )^( o '^ + 

( 7 * 


For convenience let 6i = (r 2 and 02 = 0*2 + acb"^. Then 0 i, 02 and ^ may be 
regarded as the parameters of the distribution. If L = log P, we have 

2 L = C - 6(a - 1 ) log 01-6 log 62 - K 

= C - 6 (a - 1 ) log 01-6 log 02 - ^2/01 - gi/02 — ab(x - )u)2/02 

Putting dL/dy., dL/ddi and dL/dQt = 0 , we get for the estimated values of /x, 
01, 02 the equations 

— /I = 0 


(9.32) 


6(a 




- 1 + 


91 


+ 


^1 
ah{x 


= 0 


or 
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Hence 


li - X, 


Ox = 


5(0 - 1) 


= TWj, 


- qi (b - l)mi 

“ 5 - 5 


SO that the maximum likelihood estimate of cr^+ aai? is biased. 

In the two-way classificationj in which the row, column and interaction effects 
are all thought of as random variables, normally and independently dis- 
tributed, the mathematical model for the yield corresponding to the ith row, 
the jth column and the kth replicate is 


• x^J•k == M + ytj + €^Jk 


where EM = = £^( 7 ^ 3 ) = EMk) = 0 and the variances of y^j 

and etjk are aj, cr^ respectively. The variates Xtjk are not independent, 

the covariance of two x^s in the same row being, for example, <r« 2 . 

If the mean squares for rows, columns, interaction and replicates are denoted 
by mi, m 2 , m 3 , ma respectively and if there are a rows, b columns and r repli- 
cates, the expectations of these mean squares are given by 

jB(mi) = 0-2 -[- + hr(To? 

E{m^ = 0-^ + Tdy'^ + ar<r^2 
jECms) =0-2 + r<Xy^ 

E(mi) = 0r2 


On the null hypothesis that there is no interaction, mzfmA has the F-dis- 
tribution with (a — 1)(6 — 1 ) and ab(r — 1 ) df. If this ratio is not signifi- 
cantly different from 1 , the squariances and df of interaction and error may be 
pooled to give a joint mean square with which nti and m 2 may be compared. 
If the interaction is significant, the hypothesis of no row effect is tested by 
putting F = mifmi, with a — 1 and (a — l)(f> — 1 ) df. Similarly, for the 
column effect, F = m%lm% with 5 — 1 and (a — 1)(5 — 1) df. 

A difficulty arises in extending this method to a three-way classification^ 
even when we suppose for simplicity that there is only one member in each 
sub-class. Suppose that we have a A-classes, 5 5-classes and c C-classes. 
If the three main effects, the three interactions and the triple interaction are 
all thought of as independent random variables with variances aj, cr^^, 
a^y^, (TyJ, cTafiy^ respectively, the expectations of the various mean squares 
are given by 

E(mA) = bcaj + bayj + caais^ + (ra 0 y^ 

E(inB) = caofi^ -4- -f- aa^y^ 4“ 

E(nic) - abay^ + aa^y^ + bcTyJ^ + 

^ E(mAB) — C<ra$^ + CTaffy^ 

E(mBc) = aafiy^ *+• cTafiy^ 

E(mcA) = biTy„^ 4- fTafiy^ 

EiniABc) = 
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Since there is no mean square between replicates, the triple interaction gives 
the only available estimate of error. The significance of the three first-order 
interactions may be estimated as usual, but it is evident from the above equa- 
tions fhat no straightforward F-test will serve to test the significance of the 
main effects. Each one occurs m combination with two interactions. 

It is easily verified that 

E{mA — rriAB — nicA + 'triABc) = bccj 

so that on the h 3 q)othesis that there are no real A-effects, which means that 

= 0 , 

EirriA + raABc) = E{mAB + me a) 

We can therefore make an approximate test of this hypothesis by applying 
the E-test to the ratio of ma + to + me a- The difficulty is to 
know what degrees of freedom to use. We estimate the df so that the vari- 
ances of the approximating distributions are the same as those of the actual 
distributions, both for + rriABc and for m^a + meA- 
The variance of a x^ distribution with ni df and expected value p is 
Hence if + m^ac has such a distribution, its variance must be 
(2/ni)[6c<r«2 + -f. ^ But the variance of Ma is [2/ (a ~ 1)] X 

[bcarj + b<ryj -f- since it has a x^ distribution with a — 1 df. 

Similarly, the variance of Mabc is 2<rafiy*l(a — 1)(6 — l)(c — 1). The vari- 
ance of ma + mABc is the sum of the variances of mA and mabcj so that on 
equating the two expressions for this variance we get 

^ (bcarj + ba-y^^ + -f 2<Ta0y^y^ 

^ (bcaj + bdyj + ^ 1) + cra^y^lia —!)({) — l)(c — 1) 

This is estimated by 

(mA + rriABcY 

mA^lia - 1) + MABc^/ia — 1)(6 l)(c — 1) 

In the same way n 2 is estimated by 

(mAB + mcA^ 

mAB^lia “ 1)(6 “ 1) + mcA^iia — l)(c - 1) 

The ratio + mABc)/{mAB + me a) is then approximately distributed as F 
with ni and rh df , and serves to determine whether (tJ" is significantly different 
from zero. 

The working out of the tests for the other two main effects is left as an 
exercise. 

9.10 Confidence Limits for the Component of Variance Due to Treatment 
Effects. If we wish to assign confidence limits to <r^ we must know the dis- 
tribution of the statistic which estimates it. If Vi = gi/(<T^ + ao-j,®) and 
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'■'2 = 22/0’^, then Vi and are independent variables with 6 — 1 and 
6(a — 1) df respectively. Now 


mi — W2 


-h qp-b^ __ 
a(6 — 1) a6(a — 1) 


V2 




where Xi and X2 depend on the parameters and cn^. The joint frequency 
function of vi and V2 is where % = & — 1, 712 = &(a — 1). 

Putting u = \iVi “ \2V2j we obtain the joint frequency function of u and 


By integrating over the range of V2j the distribution of u is obtained. Since 
vi goes from 0 to 00 , yg goes from— U/X2 to 00 if u is negative, but from 0 to 00 
if u is positive. The frequency function for u depends, however, on <t^ as well 
as on (Tb^, 

A parameter such as cr^, which appears in a distribution that we want to 
use to estimate another parameter, has been aptly termed by Hotelling a 
nuisance parameter. 

The variance of u is given by 


Var (u) = Xi^ Var (vi) + X2^ Var (^2) 


(9.33) 


L % 


and to a first approximation, for large values of b (say 60 or more) u may be 
regarded as normally distributed about ab^ with this variance, which is, of 
course, estimated by (2/a^)(miVni + m^jn^. 

From the definition of Vi and V2 we can write <Xb^ = — q2/v2). On 

Fisher^s theory of fiducial inference, we could regard this relation as giving 
fiducial limits for cr^^, for fixed values of qi and g2. The quantitites Vi and V2 
are assumed to have the ordinary distributions with ni and % degrees of 
freedom, but here, as in the Behrens-Fisher test, we do not have a confidence 
interval in the ordinary sense. 

Another method, also based on Fisher^s approach, is to let F = mi/m2. If 
Fa and are the critical points of the ordinary F-distribution for Ui, n^ and 
for ni, 00 df respectively, we calculate L — (F -- Fc)/{FFJ — ■ Fc*). The 
lower confidence limit for <75^ is then taken as (mi — mi)Lla (zero if F < F«). 
Similarly we calculate U = {F — fc)/{FfJ — fc), where /«(%, 712) is the 
reciprocal of Fa(7i2 , tii). The upper confidence limit is (mi — m^U fa. 

Example 7. In an experiment on counting wireworms in soil (details slightly modified 
for convenience) there were 25 plots and 6 samples from each. The mean square between 
plots (mi) was 72.96 and that within plots (m 2 ) was 38.44. The degrees of freedom were 
m « 24, 7J-2 = 125. 
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The hypothesis that the are all zero is rejected at about the 1% level, since F — 1.90 
and the 1% point is 1.94. On the hypothesis that the ^k are normally distributed, the esti- 
mated 0-2 and are 38.44 and 5.75 respectively. The variance of u is 


^ f (72 96)^ (38.44)2 

36 I 24 125 / 


12.98 


60 that the normal approximation would give the 90% confidence limits for as 0 and 11.68. 
Fisher’s method gives 

2 ^ 24 X 72.96 _ 125 X 38.44 

6t;i 6^2 


The 5% upper and lower points for vi (with ni — 24) are 36.415 and 13.848 respectively. 
The corresponding points for V 2 ('n ,2 — 125) are 151.81 and 99.90 respectively, so that the 
90% fiducial limits for cr^- are 2.739 and 13.058. 

For the third method we have Fa = 1-60, Fa' = 1.52, F = 1.90, so that L = 0.233, and 
the lower confidence limit is 1.34. Similarly, = (1.79)“^/a' == (1.73)“^ so that U = 2 49 
and the upper, confidence limit is 14.32. These values are probably the most reliable. 


9.11 Effect of Unequal Numbers in the Sub-classes on the Estimation of 
Treatment Effects. For the two-way classification of § 9.2, we can readily 
estimate the class-effects m,, 'provided that all the sub-classes contain the same 
number of entries. 

The joint frequency function for the Xjk is 

( 2 Tcr 2)-»''/2 exp I - — 3 2(^3* - ^3 - 

and the maximum likelihood estimates a,, 6*, m of ft, u are given in the 
usual way by 

' ^{Xjk — aj — bk — m) = 0, y = 1, 2, • • • a 

k 

(9.34) I — a, — h — m) = 0, k = 1,2, ■ ■ - b 

3 

^(Xjk — G; — ft — m) = 0 

jk 

Since = ^bk = 0, wc have, on dividing these equations through by 
a, by ah, respectively, 

’ Xj. — a, — m = 0 

(9.35) • x.k — bk m = 0 

X — m = 0 

Hence fi is estimated by x, by x,, — x, and ft by x.k — x. These estimates 
are all independent of each other, and the data are said to be orthogonal. 

Suppose now that in the two-way classification into a ^4 -classes and h 5- 
classes, the number of members in the sub-class A^Bk is n^k instead of 1 as we 
assumed before. We‘can obtain an estimate of cr^ by pooling the sums of 
squares within the separate sub-classes. The number of degrees of freedom 
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will he N — ah, where N = Let this estimate of variance be denoted 

by t?. 

If we ignore the difference between the A and B classifications and regard 
the data as a one-way classification into ah classes, we can apply (9.7) and 
obtain as the sum of squares between sub-classes 

(9.36) - x)^ 

where Xji is the mean of the n,* members of the sub-class AjBk- Then 
qi/ (ah — 1) may be compared with i; as a test of homogeneity. 

Considering the A classification alone, let 

(9.37) £•;. = I 


Since the variance of x^k is we have, on the hypothesis that the a, are 

zero, 

Var (x, ) == cr%-^^(l/n,k) = 

Jc 

where 


(9.38) 



The mean f,. has, therefore, the same variance as if it were the mean of 
quantities with variance <7^, so that x,. has a weight of Nj, ^If the weighted 
mean of the a quantities x,. is 

(9.39) ' Wa = 

3 

it may be proved that — WaY is distributed as xV* with o — 1 df 

so that an unbiased estimate of is given by Qi/ia — 1), where 

(9 40) Qx = ^ 

. 3 

Since this is independent of v, the ^-effects may be tested by the ratio of 
Qi/ (a ~ 1) to V. 

Similarly for the B-effects, if 


( 9 . 41 ) 



an unbiased estimate of <r® is provided by Q^f (6 — 1), where 


(9.42) 

Qj = '^MkxY — wY^Mk 

and 

m 

(9.43) 

,,, vh = k/'^Mh 

k 

The B-effects are 

tested by the ratio of Q^/ (b ~ 1) to 
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Let us now consider the question of estimating the parameters /i, a,*, ffk- 
If Xjki is the Zth member of the sub-class AjBkj our model is 

(9.44) Xjki = M 

when j = 1, 2 • ' • a, = 1, 2 • • • 6, Z = 1, 2 • n^k and where the are 
normally distributed about zero with variance <r^. The usual procedure for 
maximum likelihood estimates gives 

^ ^(xjki — aj — bk - m) = 0, j = 1, 2 • • • a 
kl 

(9.45) ^ ^(Xjki — aj - hk - m) = 0, A; = 1, 2 • • • 6 

^(xjki — a, — 5fc — m) = 0, 
jiu 

where Uj, bk, m are estimates of fik, m respectively. 

Let us write Nj. = ^n^k, N.k = the equations become 

k J 

'^rijkXjk — Nj,aj — ^rijkbk — N^.m = 0 

(9.46) ^njkXjk — ^rijkaj — N kbk — N.kfn = 0 

J 3 

Nx — ^rijkaj - ^rijkbk — Nm = 0 

jk jk 

where x is the weighted mean of all the Xjk, with weights rijk- If we imagine 

the Uj and bk so adjusted that ^rijkaj = 0 and ^Ujkbk = 0, which merely 

^ "Sr 

means absorbing parts of the constants a,, pk into /z, the quantities a,, pkj m are 
then estimated by a,*, 6 a,, m as given by the set of a + 6 + 1 equations: 

+ ^rijkbk = '^njk^jk — N^.x 

k k 

'^rijkfij + N.kbk = ^njkXjk — N.kX 
J i 

m = X 

j ^ 1, 2 • • • a, A; = 1, 2 • • • 6 

If all the rijk are equal to 1, these reduce to (9.35), but in the general case the 
a^s and cannot be estimated independently. The data are not then or- 
thogonal. 

Example 8 [Gowen, quoted by Snedecor in (5).]. The data are mean lengths of life in 
days for three strains of mice, after inoculation with one of three isolations of typhoid bacil- 
lus. In each cell of Table 25 the number in brackets is the number of mice (n,*) in the 
sub-class. 


(9.47) 
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Table 25. Sxievival Time (in days) for Mice Inoculated with Typhoid Bacillus 


Organism 


Strain 



N, 

N,x, 

I 

II 

III 

A 

4.0000 

(34) 

4.0323 

(31) 

3.7576 

(33) 

3.9300 

97.855 

384.57 

B 

6.4545 

(66) 

6.7821 

(78) 

4.3097 

(113) 

5 8488 j 

244.42 

1429.56 

C 

6.6262 

(107) 

7.8045 

(133) 

4.1277 

(188) 

6 1861 

405 70 

2509.70 

Mk 

Mk^^k 

5.6936 1 
166.95 i 
950.55 

6.2063 

171.11 

1061.96 

4.0650 

202.37 

822.63 

540.43 

2835.14 

747.98 

4323 83 


There are 2 df between strains, 2 between organisms, 8 between sub-classes and 774 between 
individual mice. The mean square v between individuals (from the original data, not ob- 
tainable from Table 25) is 5.015. From (9.36), 

== ^njhXjh^ ~ (^njkXjky/N = 1785.6 

so that qi/iab — 1) = (1785.6) /8 = 223.2. The data are obviously nor homogeneous. 

The computed numbers Mk are showm in Table 25. From the sums of columns 6 and 
7, we get ^ 

= (4323.83) V747.98 = 24,994.66 

Also = 2S, 397.83, so that Qi — 403.17. Since a = 3, the mean square between 

organisms is Qi/2 = 201.6, which is highly significant. Similarly, 

Qi = 15,346.88 - 14,873.37 = 473.51 

so that Qi/2 = 236.8, which is the mean square between strains and is even more significant. 
The analysis of variance is therefore 


Variation due to 

SS ^ 

df 

MS 

Organisms 

403.2 

2 

201.6 

Strains 

473.5 

2 

236.8 

Mice 

1 

3881.6 

774 

5.015 


The equations for estimating the a, and (3* are 

C 98ai+ 345i+ 315s + 335,= 385.00 - 98(5 5556) = -159.45 

25702 -1- 665i-+ 7862 + 1135, = 1442.00 - 257(5.5556) = 14.22 

4280, + 1075i + 1'3352 + 1886, = 2523.00 - 428(5.5556) = 145.23 
1 2075i 34oi + 6602 + 107a, = 1271.00 - 207(5.5556) = 121.00 

2425, + 3101 + 78o 2 + 133a, = 1692.00 - 242(5.5556) = 347.55 

.3345, + 33oi + 11302 + 188a, = 1387.00 - 334(5.5556) = -468.55 
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We have also 

/ 98<ii -f* 257 o 2 "f" 428<i3 = 0 
\ 2076i + 24262 + 33468 - 0 

The solution* is 

ai = -1.82356, 6 i = 0.66749 
a. - 0.08753, 62 = 1.44095 

az = 0.36499, 63 = -1.45773 

with m = 5.5556. 

The estimated combined effects, without the error terms, are given by the following table: 


Organism 

Strain 

I 

II 

III 

A 

4.3995 


2.2743 

B 



4.1854 

C 

6.5881 

7.3615 

4.4628 


9.12 Interaction with Unequal Sub-class Numbers. In the general case 
with iinequal sub-class numbers, it is not true that we can estimate interaction 
by subtracting from the total sum of squares the sums due to variation between 
rows and between columns. Because of the interdependence of the estimates, 
the '^interaction^' so calculated is not a valid estimate of cr^. Thus we cannot 
calculate interaction for Example 8 as follows: 

Table 26. Incokkect Analysis op Variance for Example 8 


Variation 

SS 

df 

MS 

Between organisms 

403.21 


2 ] 

201.6 

Between strains 

473.5 

• 1785.6 

2\S 

236.8 

“Interaction” 

908.9 


4 j 

227.2 

Residual 

3881.6 


774 

5.015 

Total 

5667.2 

782 



It is, however, possible to calculate a valid interaction term by deducting 
from the total sum of squares the amounts due to the fitting of the various 
constants, namely the a's, jS's and fx. On the additive hypothesis, the remain- 
der should be equal to the residual sum of squares. If not, the difference 
can be attributed to interaction. 

3y the basic hypotheses of analysis of variance, the sum 

^(Xjkl — OLj — ^k — txY 

* Found by systematic elimination of the variables. Express ai, (12, az in terms of the 6^s 
by equations 1, 2, 3, and substitute in equations 4, 5, 6. Elimmate 61 by equation 8. Solve 
the resulting pair of equations for 62 and 63. For other methods see Chapter X. 
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is distributed as with N degrees of freedom. The difference between 
an observed x and its estimated value is x,ki — a, ~ bk — m. The sum of 
squares of these discrepaiicies is 

(9.48) '^(xjkt — aj -hk — my = — a, - ft — (a, — a,-)® 

JU ] 

+ k{hk- fiKY + Nim- 

k 

By (9.46), the a,, bk and m are linear functions of the Xjki- There are 
a + 6 + 1 of these quantities, subject to two constraints, namely, 

(9.49) = 0, = 0 

J k 

Hence we can express them in terms of a + & ~* 1 orthogonal normal 
variates (as in § 4.13), and the sum of squares will be distributed as with 
a + 6 “ 1 df. This sum of squares will be equal to the sum of squares for 
the original variables, namely, 

^2V,.(a, - a,y + k(bk - ft)*+ mm - yY 

J k 

Hence, from (9.48) it follows that ^(xjki — a, ~ — mY is independently 

m 

distributed as with N — a — b + 1 di. 

Now, by writing out the separate terms, we have 

— a, — 6* — m)^ = “f" 2-^ kbk^ + Nm^ 

JU J k 

Jk 

+ "^^ajbkUjk — 2m ^a/rijk — ^m^bkTijk 

Jk Jk Sk 

"4“ j.d] "i” ^^^^JdjkXjk 

J k k 

“i” ^ 3*^) “f” '^^bkC^ kbk ”f" ^^JdjkXik 

k J J 

+ N.kx) + 2Nm(m — x) — Nm^ — 

Jk 

- ^bkTijkXjjc — ^djNiX^ + 2m) 

Jk J 

-:^bkN. k{x + 2m) 

By (9.47), the expressions in parentheses in the second, third and fourth 
terms on the right-hand side vanish. Also the last two terms vanish by 

(9.49) . We have left 

(9.50) '^(Xjki - a, - bh — mY- ^x,ki^—Nm^— '^ajn,kX,k- '^bkfijkXik 

y y 
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where the first two terms on the right give the total sum of squares, with 
i\r — 1 df and the other two terms the reduction due to the fitting of the con- 
stants, with a + & — 2 df . 

We have already seen that the sum of squares due to residuals (v) can be 
split off from the total, leaving an amount (^i) with a6 — 1 df which, on the 
hypothesis that the Pk are all zero, is an independent estimate of <r^. If we 
subtract from qi the sum of squares due to fitting the constants, the remainder 
with (a — l)(6 — 1) df will also be an independent estimate of cr^, and may 
be used to test interaction. 

In the foregoing example, = 345.0, ~ 1264.6, so that 

the sum of squares due to interaction is 1785.6 — 345.0 — 1264.6 = 176.0. 
The correct analysis of variance is therefore as shown in Table 26A. 

Table 26A. Correct Analysis of Variance for Example 8 


Variation 

SS 

df 

MS 

Organisms 

403.2 

2 

201.6 

Strains 

473.5 

2 

236.8 

Interaction 

176.0 

4 

44.0 

Mice 

3881.6 

774 

5.015 


There is clearly a definite interaction effect here, although not as great as 
would have been deduced from Table 26. 

The organism effect is tested by the ratio 201.6/44.0 and not by 201.6/5.015. 
The latter would be correct if we intended our inference to apply only to the 
same strains of mice as used in this experiment. 

9.13 Proportional Sub-class Numbers. If it happens that the numbers 
njjfc, although unequal, are proportional in the different row^s (and therefore 
also in the different columns) the data are still orthogonal. In this case 
every n^k can be written as the product of Z, and m* where Z, is a number 
characteristic of the ;/th row and mk is a number characteristic of the fcth 
column. The aj and 0k can be estimated separately from the means of the 
A-classes and the 5-classes respectively. The sum of squares between 
4-means can be calculated as in (9.6), and similarly for the sum of squares 
between 5-means. The difference between the sum of squares between 
classes and the total of these 4- and 5^sums is attributable to interaction. 

9.14 The Missing Plot Technique. It is evident from the analysis in §§ 9.1 1 
and 9.12 that the computations are very much simplified when there is only 
one individual in each sub-class. If so, however, tliere is the risk of losing 
by accident, such as the death of an experimental animal or the ravages of an 
insect pest on a crop-plant, all the information available about some particular 
sub-class. If this happens to a few sub-classes the situation is not hopeless. 
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It is possible to form an estimate of the missing values and to carry out the 
analysis with only a minor loss in precision. 

On the usual mathematical model, the a„ /3/b and /i are determined so as to 
minimize the residual sum of squares Se attributable to error. If some of the 
Xjk are missing, we replace them by estimates X,*, regarded as unknowns, 
and minimize Se with respect to these unknowns. 

If for convenience we think of a randomized block experiment with a 
variates or treatments and h blocks, we can denote the treatment totals by 



The residual sum of squares is then given by 

(9.51) 5. = ^ - I ^ 


Assuming that the jkth. value is missing, we have, on differentiating with 
respect to Xjk and putting the derivative equal to zero, 

(9.52) Xjk - Bk/a - TJh + G/al = 0 


In (9.52) the quantities Bk, Tj, G all include the unknown value X,**. If 
Bk, T/, G' represent the totals, excluding X,*, we have 


or 

(9.53) 


Xjkil - 1/a - 1/5 + l/a6) = B^'/a - T//h + G'/a6 

w 5.B/ + aT/ ~ G' 

(a - 1)(5 - 1) 


If there is only one missing value in the whole table, (9.53) will give the 
required estimate. If there are two or more, this equation, applied to each 
value, will give a set of simultaneous equations for the unknowns. 


Example 9. Assume that the table represents yields for 5 varieties, each planted in 5 
blocks. There are two missing yields, represented by x and y. 


Variety 

Blocks 

Totals 

1 

2 

3 

4 

5 

A 

9.5 

4.0 

6.6 

4.9 

9.3 

34.2 

B 

aj 

6.2 

6.0 

7.6 

7.6 

27.4 + X 

C 

11.8 

9.3 

15.4 

13.2 

15.9 

65.6 

D 

6.4 

5.4 

7.6 

8.6 

y 

28.0 + y 

E 

3.3 

5.1 

4.6 

6.3 

6.3 

25.6 

Totals 

31.0 + a; " 

30.0 

40.1 


39.1 +v 

180.8+1+1/ 


For the yield z, Bh* * 31.0, T/ 4 27.4, (?' — 180.8 + y, and for yield y, Bk “ 39.1, 
T/ « 28.0, O' = 180.8 + a;. Hence, from (9.63), 
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X = (165.0 + 137.0 - 180.8 - y)/16 
y = (195.5 + 140.0 - 180.8 - a:)/16 
or 

r 16a; 4- 2/ = 121.2 
U + 162/ = 164.7 

whence x = 7.00, y = 9.23. These are the estimated yields. 

The analysis of variance may now be carried out as usual, with the estimated 
values substituted for x and y, but with 2 df subtracted for both total and 
error sum of squares. This procedure is not, however, strictly correct. It 
introduces an upward bias into the mean square between variates which tends 
to exaggerate the significance of varietal differences. To see this, let us con- 
sider the hypothesis that there is no difference between varieties, so that all 
the are zero. The sum of squares between varieties will then be lumped 
with the error — the total may be called the conditional error. If we mini- 
mize the conditional error Sc instead of the residual Sc with respect to the miss- 
ing values we shall get a different set of values for the unknowns from the set 
given by (9.53). We have, in fact, 

(9.54) Sc .= - i ^£*2 

and on putting dSc/dXjk = 0, we get for the new unknown Z,*' 



Therefore 

(9.55) Z,/ = Bk7ia - 1) = c, say 

The bias in the sum of squares between varieties is due to using (9.53) instead 
of (9.55) in testing the null hypothesis, although the former should be used 
to obtain unbiased estimates of the unknowns. The value of Sc using (9.53) 
is too great by an amount 

I iX,k+Bky+ ^ (Ck+Bk'r = ^ (X,k^-Ck^) - {X,k-CK) 

Gf Of Cl Q/ 

= ^ (.Xik-Ck)iX,k+Ck-2ck) 

by (9.55) 

Cl 

This amount should be subtracted for each missing value from the sum of 
squares due to varieties in order to correct for bias. 
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In Example 9, ci = 31.0/4 = 7.75, cb = 39.1/4 = 9.78, so that the bias is 
|[(7.00 - 7.75)2 + (9.23 ~ 9.78)2] == o.692 

The sum of squares between varieties, using the best estimates of x and y, is 
186.73 so that the correction here makes very little difference. The analysis 
of variance is 


Vanation 

SS 

df 

MS 

Blocks 

34.40 

4 

8.60 

Varieties 

186.04 

4 

46.51 

Residual 

39.10 

14 

2.79 

Total 

259.54 

22 



The 5% point for F with ni = 4, n 2 = 14, is 3.11 and the 1% point is 5.03. 
There is a barely significant effect between blocks but a highly significant one 
between varieties. 

9.15 Analysis of Covariance. Suppose that in each family or group of a 
one-way classification we have a number of pairs of variates y,*, and that 
we are interested in the effect of the classification on the relationship between 
X and y, as expressed by their covariance. This leads to the analysis of 
covariance. 

We may wish to test the hypothesis that the linear regression of y on a; is 
the same for the different families. According to this hypothesis, a linear 
regression exists in the parent population, expressed by 

(9.56) 1 ? — juy = /3(rc — px) 

The true regression coefficient /3 may be estimated by the sample regression 
as a whole or by the regression of class means or by the various regressions 
within classes. There is no simple test for the significance of differences or 
ratios of these observed regressions. We can, however, fiind estimates of the 
variance of y after removing the effects of regression, and use these estimates 
to test for the presence of class effects. 

Analogously to (9.4) we have for covariance 

(9.57) y) ~ ^•k)(,yjk y 

ik jk 

+ '^nkix k — x}(§.k - y) 

k 

where n* is the number of pairs in the fcth class. 

It is convenient to adopt a notation for the sums of squares and products, 
due to E. S. Pearson. Let Cm == — x.k){y3k — y k)f which is the sum 

of products in the ^th family, and so proportional to the covariance. The 
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corresponding sums of squares for x and y may be denoted by Ciu and 
and the regression coefficient oiy on x by hh = Cnk/Cnh. A combined regres- 
sion coeflpLcient within families is given by K = '^Cnk/'^Cnk = Cnw/Cnw, 
say. The sum of products between families is given by C12/ = ^nk{x.k — x) • 
iy k " y)) with corresponding notations for the sums of squares. The regres- 
sion coefficient between families is 6/ = Cnf/Cuf. Finally, the total sum of 
products is C120 = J)? and the corresponding regression 

coefficient is ho = 0120/^110. With this notation, ( 9 . 57 ) may be written 

( 9 . 58 ) C120 = Ci2f + = C12/ + Cnw 

k 

with similar relations for Cno and C220. 

We assume that, apart from the effects of regression, yjk is normally dis- 
tributed, that is, 

( 9 . 59 ) == -»? + 

where is given by ( 9 . 56 ) and the e,* are independent normal variates with 
mean 0 and variance o-^. As in Chapter VIII, we may calculate the sum of 
squares of residuals after allowing for regression and hence obtain an estimate 
of For the experiment as a whole, this sum is 

( 9 . 60 ) >So === — y — bo(Xjk ~ = C220 “ ?>oCi2o 

3k 

and So/ {N — 2) is an unbiased estimate of cr^. Again, we may calculate the 
sum of squares of residuals for the family means measured from the regression 
of the means. This sum is 

( 9 . 61 ) j/.fe — y — ' hf(x k x)}^ == C22/ ““ hfCi2f 

k 

and Si/Q> — 2) is another estimate of cr^. By subtraction, using ( 9 . 58 ), or by 
direct computation, we can find Cuw, C^w and Ci2w, and hence calculate 

( 9 . 62 ) S2 “ C/ 2 S,w hv/JvLw 

This gives a simi of squares of residuals within families from regression lines 
with a common dope hw and passing through the respective family means. 
It is made up of two parts, one (^83) consisting of the sum of squares of devia- 
tions from individual regression within families, and the other (^4) consisting 
of the squares of differences between the family regressions and the combined 
regression. By definition, 
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Therefore 

Si = ~ bwCjsw 

= 

= 'XQ>k - Kycm 

( 9 . 64 ) = 'Xiih - - X 

jk 

The nxmiber of degrees of freedom for ^3 is — 2) = iV^ — 26 . The 

k 

number for /S4 is 6 •— 1, so that the number for & is iV — 6 — 1. 

In order to test whether there is any class-effect we may compare the esti- 
mate of <r^ derived from So — 8z (between families) with that derived from 
Sz (within families). We put 

(9.65) F ^ (So- Sz){N - 2h)/{Sz(2l - 2)} 


with degrees of freedom 26 — 2 and N — 26 . In doing this we are in effect 
adjusting for the individual regressions. We may prefer, however, to adjust 
for the combined regression with coefl 5 .cient bv,, as this is more accurately 
determined than the individual values of 6^. If so, we should use ^2 instead 
of Szi and test by means of 


(9.66) 


p (So - &)(W -6-1) 
S2Q) - 1) 


with 6 — 1 and iV — 6 — 1 df . 

The difference in the two methods of adjustment may be more clearly seen 
in Figure 29 . If P is a class mean and M the grand mean, the mean adjusted 


for individual regression is Q. The 
adjusted mean yk is given by 




\ 


Fw 

Yi. 

1 

R * 




y 

y 



M 



> 

C X.k 


Fig. 29 


( 9 . 67 ) Vk^ y k - hk(x,k — X) 

If, however, we draw through the 
class mean a line parallel to the 
combined regression line (with slope 
6t^), the adjusted mean is the point 
P, given by 

( 9 . 68 ) yw = y^k - bw{x.k ~ x) 


The significance of differences 
among the individual regressions may be tested by the ratio 


( 9 . 69 ) 


S, N- 2 b 
^ Sz 6-1 


with degrees of freedom 6 — 1 and N — 26 . 
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If this is non-significant, and Si may be pooled as Si. 

Example 10 [Snedecor ®]. Six groups of rats, 10 to a group, were given different foods and 
for eachi rat the food-intake x (in 10-calorie units) and the gain in weight y (in grams) were 
recorded. The data are given in Table 27. 


Table 27. Food-intake (z) and Gain in Weight (y) for 60 Rats 



Group 

Rat 

1 

2 

3 

4 

5 

6 


X 

y 

X 

y 

X 

y 

X 

y 

X 

y 

X 

y 

1 

108 

73 

99 

98 

194 

94 

165 

90 

124 

107 

140 

49 

2 

136 

102 

117 

74 

198 

79 

164 

76 

95 

95 

177 

82 

3 

138 

118 

90 

56 

196 

96 

161 

90 

116 

97 

189 

73 

4 

159 

104 

141 

111 

198 

98 

159 

64 

112 

80 

142 

86 

5 

146 

81 

106 

95 

210 

102 

175 

86 

123 

98 

216 

81 

6 

141 

107 

112 

88 

196 

102 

135 

51 

110 

74 

200 

97 

7 

175 

100 

110 

82 

230 

108 

132 

72 

137 

74 

255 

106 

8 

149 

87 

117 

77 

222 

91 

190 

90 

105 

67 

173 

70 

9 

174 

117 

111 

86 

220 

120 

145 

95 

135 

89 

153 

61 

10 

176 

111 

122 

92 

228 

105 

142 

78 

126 

58 

160 

82 

Total 

1502 

1000 

1125 

859 

2092 

995 

1568 792 

1183 839 

00 

0 

Cn 

r87 


In this example each n* is equal to 10, so that N = 60 and 6=6. 

The sums of squares and products for the separate groups are given in Table 27 A. 


Table 27A. Sums of Squares, Regression Coefficients and Adjusted Means, 

FOR Data op Table 27 


k 

m 

Cnk 

Ci2k 

h 

Szk 

y k 

yi 

yw 

1 

4159.6 


1646.0 

0.3957 

1410.7 



101.6 

2 

1682.5 


1138.5 

0.6767 

1260.5 

85.9 

114.3 

■a 

3 

1897.6 


738.0 

0.3889 

785.5 

99.5 

78.3 


4 

3023.6 

1736.6 1 

1184.4 

0.3917 

1271.6 

79.2 

78.3 

78.4 

5 

1576.1 

2220.9 

-58.7 

-0.0372 

2218.7 

83.9 

82.5^ 

96.7 

6 

11630.5 

I 2464.1 

3818.5 

0.3283 

1210.4 

78.7 

70.2 

69.6 

Total 



8466.7 

(0.3532) 

8167.4 





The last three columns give the actual mean gains and the mean gains adjusted for 
individual regressions and for the combined regression. The last row gives the values of 
Cnw, C^iwf Cnw, bw and Sz* By (9.62), S 2 = 8595.6, so that ^4 = 438.2. From the whole 
data of Table 27 we find Cm = 91632.6, C220 = 16198.9, C120 = 13987.7, 60 = 0.1526, 
So = 14063,7. From the column totals of Table 27, we get-<7u/ = 67662.7, = 4612.9, 

Ciif = 5521.0, 6/ = 0.0816j = 4162.4. Note that Cm - Cn/ -f Cnw, with two similar 

equations, thus checking the arithmetic. 

We have, therefore, the following analysis of covariance (Table 27B). 
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Table 27B. Analysis op Covapiance for Data of Table 27 


Yar%ai%on 

Residual SS 

df ^ 

MS 

Within groups {Sz) 

8157.4 

48 

169.9 

Between regressions (^$ 4 ) 

438.2 

5 

87.6 

Within groups, from combined regression (S^) 

8595.6 

53 

162.2 

Between groups, from regression of means (< 8 i) 

4162.<r 

4 

1040.6 

Difference between 6«, and hr (Sh) 

1305.7 

1 

1305.7 

Sa — S 2 Si -Y 

5468.1 

5 

1 1093.6 

So — jSs == “b Ss 4“ Si 

5906.3 

10 

1 590.6 

Total (So) 

14063.7 

58 



The value of F given by (9.65) is 3.48, with 5% and 1% points at 2.03 and 2.71. so that 
there is clearly a significant difference between the adjusted means y*. That is, the different 
groups of rats show a very significant difference in gain in weight, even when adjusted to a 
common food-mtake. The value of F given by (9.66) is 6.74, with 5% and 1 % points at 
2.39 and 3.39, so that there is even greater difference between the adjusted means yw. From 
(9.69), F IS less than 1 , so that there is obviously no significant difference between the 
regressions. This justifies the procedure of (9.66) in which the regressions are combined. 

The regression coeflScient for the group means (b/) is small. The residual mean square 
for group means (^Si) is highly significant as compared with that for the average regression 
withm groups (^ 2 ), which demonstrates that the group means are very erratic. The quan- 
tity Si in Table 27B with a single degree of freedom, represents the difference between tbe 
regression of group means (&/) and the average regression within groups (bu). If Si had 
been non-significant, a significant /Ss would have indicated a different trend for the group 
means from that for the individuals within a group. Here, of course, the group means show 
so little trend that this interpretation has no real meaning. 

9.16 Experimental Design. It is apparent from the examples in this 
chapter that the analysis of variance and covariance is a powerful tool for 
extracting as much information as possible from the results of experiment. 
It is important, of course, that the experiment should be properly designed in 
the fiirst place, and a great deal of work has been done in developing efficient 
experimental designs and procedures. The reader may refer for details to 
R. A, Fisher^s The Design of Experiments (5th Edition, 1949) or to Cochran 
and Cox, Expenmental Designs^ 1950. Some very interesting applications 
of group theory to problems of design have been made by R. C. Bose, a leading 
member of the Indian school of statisticians.^® Here space does not permit of 
more than a brief reference to some of the commoner designs. 

The simplest design is that of complete randomization. In a field experiment 
it would mean that the various replications of a certain variety or treatment 
are scattered over the whole area [Figure 30(a)]. If there is no particular 
reason for grouping, this arrangement is satisfactory, as it permits the maxi- 
mum number of degrees of tVeedom for error. 

If a complete set of treatments are grouped together in a block we have 
the randomized block design, illustrated in Figure 30(6). 
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a b c 



Fig. 30. (a) Complete Randomization, (b) Randomized Blocks, (c) Latin Squaees, 
(d) Graeco-Latin Squares, (e) Split Plot Randomized Blocks 

This is illustrated in Example 2, at the end of § 9.2, where tne blocks con- 
sist of types of hog — the animals in one block may be matched in age or 
weight or may be from the same litter. In a field experiment the blocks may 
be groups of plots chosen so as to be comparable in respect of soil fertility, 
etc. If the blocks as a whole are identified with some source of variability, 
this variability is allowed for and does not affect the experimental error. The 
experiment is thus more precise than a purely random arrangement. 

The Latin Square [Figure 30(c)] is a device for controlling two sources of 
error at once. In field work the treatments are so allocated among the plots 
that no treatment occurs more than once in any one row or any one column 
of the Latin Square. Variability among rows and among columns is removed 
from the error. This serves to control variability due to gradients of soil 
fertility in two directions at right angles across the field. In an animal- 
feeding experiment the treatments might be rations, the rows litters and the 
columns weights. If, for example, we had four types of ration under investi- 
gation, we should use 16 animals, 4 from each of 4 litters, and group them 
approximately in 4 weight classes so that each class included one animal from 
each litter. The treatments would then be allocated among the animals 
according to a Latin Square, and variability between litters and between 
weights would be removed from error. 

In a Latin Square a rows and columns, let denote the sum of the jth 
row, Ck the sum of the fcth column, and jP» the sum of the zth treatment, for a 
variable x (t, j, = 1, 2, • • * a). Let G be the grand total. The analysis of 
variance is 
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Vanabihty 

SS 

df 

MS 

Rows 

Si = - G^/a?, 

a — 1 

Si/ (a — 1) 

Columns 

Sn = 

a — 1 

Si/(a - 1) 

Treatments 

S3 = '^T,ya - 

a — 1 

So/(a - 1) - 

Error 

So -Si -Si- So 

(a — l)(o — 2) 

(So -S,-Si- So)/(a - l)(a - 2) 

Total 

So = - <?Va* 

a2 - 1 



For a large number of treatments it is often difficult to arrange for the right 
number of rows and columns, and for a small number there are not many 
degrees of freedom left for error. The Latin Square is ideal for 5 to 8 treat- 
ments or varieties. 

The number of possible Latin Squares for a given a is very large for the larger 
values of a, even when restricted to the standard kinds, from which others may 
be obtained by permutation of rows and columns. A standard square has 
the letters in their natural order in both the first row and the first column, and 
al (a — 1)1 squares may be obtained from it by permutation. There are 
4 standard squares^or a = 4, 56 for a = 6 and 9408 for a = 6, and the total 
numbers corresponding are 576, 161280 and 812851200 respectively. Ex- 
amples of squares up to 12 X 12 are given in Fisher and Yates’ Statistical 
Tables. 

If two Latin Squares are superimposed so that each letter of one square 
occurs once and only once with each letter of the other square (the two squares 
being then said to be orthogonal), we get a Graeco-Latin Square. The arrange- 
ment is illustrated in Figure 30(d), where the Greek letters are associated with 
the Latin ones in the way described. This is a rather specialized and uncom- 
mon design, necessitating the picking out of three factors which are real 
sources of variability. In an animal-feeding experiment the third factor 
might be the pens. If the pens were lettered, in the example mentioned above, 
from a to 6 and the animals allocated to them according to the Greek letters, 
the effect of differences among pens would be removed from error. The de- 
grees of freedom for error in an a X a Graeco-Latin square are (a^ — 1) — 
4(o - 1) = (a — 1) (a — 3), so that the number is rather small unless a > 4. 
No Graeco-Latin square is possible for a = 6. 

9. 17 Split Plots. Confounding. It is often desired to apply the same treat- 
ment at different levels. Thus a fertilizer may be applied to experimental 
plots in several different amounts. A simple example is a Split Plot design 
[Figure 30(e)] in which each plot in a randomized block design is split into 
two sub-plots for testing some treatment at 2 levels. The allocation of the 
levels to the sub-plots is random. The purpose of this design is to give 
maximum accuracy to the comparison of levels. 
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A good example is given by Goulden.^® Wheat of two varieties was to be 
tested for incidence of root-rot. There were 10 dijfferent treatments (methods 
of dusting the seed) and in half the cases the soil was inoculated with a root- 
rot organism and in the other haK not. Two strips of 10 plots each formed a 
block, one strip being planted with variety A and one with variety B, Each 
plot was split into two, one half-plot being inoculated and the other half-plot 
not. There were 4 complete blocks, and hence 160 half-plots altogether. 
The total number of df was 159, 80 within halves of split plots and 79 between 
plots. If we ignore for the moment the difference between varieties, there 
were 8 strips each containing 10 randomized treatments, so that of the 79 df 
between plots, 7 were between strips, 9 between treatments and 63 belonged 
to error. But actually, of these 7 df between strips, 1 was between varieties, 
3 were between the blocks and 3 belonged to error (the error appropriate to 
the test of significance between varieties). Also, the 63 df for error contained 
9 df attributable to interaction between varieties and treatments, leaving 54 
for the error appropriate to a test of significance between treatments. 

Finally, of the 80 df within halves of split plots (which naturally contained 
no direct effects due to blocks, varieties, or treatments), 1 df corresponded to 
difference between inoculated and uninoculated, 1 to interaction of inoculation 
with varieties, 9 to interaction with treatments, and 9 to the triple interaction 
of inoculations with varieties and treatments, leaving 60 for the error appro- 
priate to a test of significance between inoculated and uninoculated soils. 
Interactions with blocks, and quadruple interactions, have here been included 
in the error, and the triple interaction might be so included also. The table 
of degrees of freedom would thus read: 


Source of Variation 

j 

df 


Between blocks ] 





Between varieties > Between strips 

1 

71 


Between 

Error (1) J 

3 J 


^ 79 

plots 

Between treatments 

9 



Interaction V XT 

9 ^ 

► 63 ^ 



Error (2) 

54 > 




' Between 1 and U 

1] 



Within 


Interaction I X F 

1 



halves of 

Interaction I XT 

9 1 


80 

split plots 

Interaction I XT XV 

9 





Error (3) 

60J 



Total 

159 


In complete Uocks^ as in the examples already given, every treatment is 
represent^, but sometimes this procedure would make the blocks too large 
for convenience. It is, in many experiments, advisable to use incomplete 
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blocks j but this means that it will not be possible to estimate separately all the 
interactions. Those effects which are not estimable are said to be confounded. 
Thus suppose we wish to test three fertilizers containing nitrogen (N), potas- 
sium (K), and phosphorus (P), each at 2 levels (presence and absence). 
Including the control with no fertilizer, we should have 8 treatments, and with 
4 replications this would require 32 plots arranged in 4 blocks. There would 
be 3 df for main effects, 3 for simple interactions, 1 for a triple interaction, 
3 between blocks and 21 for error. The triple interaction N X P X K would 
be the most difficult to interpret and probably the least important, and might 
well be conf oimded with block effects. If we choose 8 blocks with 4 plots each, 
and in 4 of these blocks put the treatments 0, NP, PK, NK, and in the other 4 
the treatments N, P, K, NPK, the sum of squares between blocks will include 
also the triple interaction effect. To see this, let x^Jh represent the total 
yield from all four plots with the ith level of N, the jth level of P and the fcth 
level of K (i, k = 0 or 1). The yield from treatment N, for example, is 
represented by xioo. Since the ‘^sum of squares^’ for two variates is half 
the square of their difference, the sum of squares between blocks is 

§ * [^000 + a^iio 4“ iron + ^loi 2:100 ““ 2:010 ““ 2:001 2 : 111 ]® 

there being 16 plots in each set of 4 like blocks. In a notation similar to that of 
§ 9.4, x.jk = + Xijjc), etc., and x.,}c = -1(2:00^; + 2:oijb + 2:10ft + 2:11ft), etc. 

The triple interaction sum of squares is given by 

(9.70) 4 ^|^[ 2 :tjft **“ 2 /./ft ““ 2 /t.ft Xtj. 4“ Xi. . 4“ 2 :./. 4“ 2 :. .ft ““ x]® 

It is easily verified that for every value of i, jy fc, the bracket in (9.70) reduces to 

± ■§■ [2:000 4 “ 2:110 4 “ 2:011 4 h 2:101 — 2:100 ““ 2:010 2:001 "" 2:111] 

so that the triple interaction sum of squares is given by exactly the same com- 
bination of yields as the sum of squares between blocks. There will now be 3 
df for main effects, 7 between blocks, 3 for simple interactions and 18 for error. 
The advantage of having a greater number of degrees of freedom between 
blocks is that variation between blocks due to soil heterogeneity is more 
adequately estimated, the blocks being* smaller and more numerous, and this 
will probably lead to a more accurate estimation of the principal effects than 
if the design of complete randomized blocks had been chosen. 

9-18 Intra'^lass Correlation. Suppose we have k measurements of a variate 
X, one on each of k individuals in a family (fc > 2), and we repeat these 
measurements on N families. The correlation between members of a family 
is called the intra-class correlation and the coefficient is defined by 

^ - J) 

where x is the common mean and s® the common variance, calculated from all 
the hN individuals. 
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By definition, the total sum of squares is 
JcNs^ = 

i a 


and the sum of squares between families is 

— x)^ = T ~ 

^ ry a i 

~ y "h y 

^ t a a 

= Ns^ + Ns%k -l)r 
= Ns‘i(l + (k- l)r] 

» 

The iatra-class correlation coefficient is therefore never less than “l/(fc ■— 1). 
The sum of squares within families is given by 

kNs^ - Ns^^ll + - l)r] = Ns^^ik ~ 1)(1 - r) 


Fisher^® has defined a variable z by the relation 

1, 1 + (k- l)r 

(9.72) 3 = 5log— 


which is of the same form as the ordinary Fisher of § 8.15 when fc == 2. 
This variable z has a distribution which tends to normality for large N, 
although not as rapidly as when = 2. The variance approximates for large 
AT to k/[2{k — l)(Ar — 2)]. If f is the population value of z, then for large N 

E{z)^^-l\og.[N/{N -1)] 

so that z may be corrected for bias by adding | loge [N/{N — 1)] or approxi- 
mately (^N — 1)~^ 

In terms of the analysis of variance, 

z ^ l\og{SB/Sw) + llog{k 1) 

where Sb is the SS between families and Sw the SS within families. Hence 
^2. ^ _ 1)Sb/8w^ 

On the assumption that there is no true correlation between members of a 
family, so that is an estimate from kN random observations of a common 
variance the quantity {k — \)Sb/Sw is the ratio of two independent 
estimates of cr^. In this case z has the distribution of Fisher^s 2 in § 7.14, or 
in other words, [1 + (^ — 1)^]/(1 “ 0 has the Snedecor F-distribution with 
iV — 1 and NQc 1) df. When the variance between families is significantly 
greater than that within families, the existence of intra-class correlation is, 
indicated. The test for intra-class correlation may thus be regarded as a 
test of homogeneity of variance. 

« 

Example 11. In Example 1, § 9.1, omitting the fourth c^^lunin, we have four “families” 
each consisting of four concrete cylinders. The variable is the breaking strength. 
The SS between families is 53,830 and the SS within families is 90,844, so that 
e2* = (1+ 3r)/(l - r) = 1.778, giving r ^ 0.163. 



284 


IX 


Analysis of Variance and Covariance 

The 5% and 1% points for 3 and 12 df are 3.49 and 5.95, so that there is no reason to 
doubt the homogeneity of the variance. In other words, the intra-class correlation is not 
significantly different from zero. 

The value of z is 0.288 and the bias correction is i \oge[N/ {N — 1)] = 0.144. The correc- 
tion is, therefore, very considerable for an iV" as low as 4, and in this case raises z to 0.432 or 
F to 2.37. Even this value is non-significant, however. 

Problems 

1. {MiW texty revised,) Manufacturing industries were classified into those producing 
perishable, semi-durable, and durable goods. An average of changes occurring between 
1929 and 1933 m the selling prices of the products of each of these categories was computed 
giving the index numbers shown in the yx column of the following table. 


Class of indmtryy 

Number of 
indusiriesy Nx 

MeanSy 

Computations 

Producing perishable goods 

34 

69.81 

6 - 1 = 2 , N - b= 82 

Producing semi-durable goods 

26 

66.41 

gi = 2,161.8800 

Producing durable goods 

25 

78.96 

52 = 15,564.9040 

All industries 

85 


Q = 17,726.7840 


Compute F and test the null hypothesis that there was no real difference in the price move- 
ments of the three different classes of industry for the years 1929-1933. 

2 . Prove that 






N1N2 

m -f N2 


(£1 - £ 2)8 


3 . Show that the test for significance between two means is a special case of the test for 
variation between means of families as given in § 9.1. 

Hint. When 6 « 2, reduces to the expression given in Problem 2. Also g '2 becomes 
'iViSi* + NiSiK Hence ‘ . 


becomes 


N1N2 N1+N2-2 
JSfi +• Ni ATiSi* + N2S2^ 


iU - ^ 


and the square root of this is the t of (7.61). 

4. The data represent sugar yield (tons per acre) for 9 varieties of sugar beet, grown each 
on 5 plots. Assuming that the design consists of 5 blocks each of 9 randomized plots, ana- 
lyze the variance, and test for a significant difference between varieties. 


Block 

Variety 


A 

B 

C 

D 

E 

F 

a 

H 

J 

1 

1.94 

1.70 

2.23 

2.14 

1.80 

1.82 

1.91 

1.90 

1.98 

2 

2.08 

1.96 

2.26 

2.08 

2.23 



2.25 

2.03 

3 

1.86 

1.83* 

2.22 

2.16 

1.67 


2.22 

1.92 

1.81 

4 

2.21 

1.60 

2.08 

2,16 

2.11 

1.96 

2.14 

1.99 

1.77 

5 

2.03 

2.13 

2.02 

2.17 

i 

2.01 

2.28 

2.28 

2.02 

1.88 
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6. A 5 X 5 Latin Square experiment gave the following yields for 5 varieties A to 
planted each in 5 plots in the arrangement indicated. 


B 

D 

E 

A 

C 

5.8 

6.4 

3,3 

9.5 

11.8 

C 

A 

B 

E 

D 

9.3 

4.0 

6.2 

5.1 

5.4 

D 

C 

A 

B 

E 

7.6 

15.4 

6.5 

6.0 

4.6 

E 

B 

C 

D 

A 

6.3 

7.6 

13.2 

8.6 

4.9 

A 

E 

D 

C 

B 

9.3 

6.3 

11.8 

15.9 

7.6 


Construct a table showing sums of squares, degrees of freedom and mean squares, due to 
rows, columns, varieties and error, and test for significance between varieties. 

6. Work out on the lines of § 9.14 the technique of allowing for a single missing plot in a 
Latin Square design. 

Hint The missing value x occurs in a row sum, a column sum, and a treatment sum. 
The conditional SS assumes that the treatment effect is zero. 

7. Suppose that in the Latin Square experiment of Problem 5, the yield of variety B in 
row 1 and column 1 had been missing. Estimate this yield. Obtain also the best estimate 
of the yield of A minus the yield of J?, and test whether the difference of these yields is 
significant. 

8 . The following data represent yields of millet in 25 plots. Five different spacings 
of the plants were used, namely 2", 4", 6", 8" and 10", and plots with these spacings were 
arranged in a Latin Square. In the diagram A, B, C, D, E represent the five spacings in 
order, and the yields are given under the respective letters. 


B 

E 

A 

C 

D 

257 

230 

279 

287 

202 

D 

A 

E 

B 

C 

245 

283 

245 

280 

260 

E 

B 

C 

D 

A 

182 

252 

280 

246 

250 

A 

C 

D 

E 

B 

203 

204 

227 

193 

259 

C 

D 

B 

A 

E 

231 

271 

266 

334 

338 


Test for variations between spacings. Compute the correlation coefficient between mean 
yield and spacing, and test this coefficient for significance. 

9. Treat the data of Problem 8 as a problem in covariance. That is, assume that we 
have a 2-way classification (rows and columns) with two variables x and y in each cell, x 
being the spacing and y the yield. Find the mean yields in rows after adjusting for regres- 
sion on spacing, and test whether there is any significant diffeJence. 

10. In a gremihouse experiment on wheat, 4 fertilizer treatments of the soil and 4 chemical 
treatments of the seed were used (including in each case a control with no treatment) . Each 
combination of treatments was applied to 3 plots, which were placed at random in the avail- 
able space. The yields are given in the table: 



z86 Analysis of Variance and Covariance . IX 


PertilizpT 

ChemtcaL Treatment 


I 

II 

in 

IV 

1 

21.4, 21.2, 20.1 

1 

20.9,20.3,19.8 

1 19.6, 18.8, 16.6 

17.6, 16.6, 17.5 

II 

12.0, 14.2, 12.1 

1 13.6, 13.3, 11.6 

13.0, 13.7, 12.0 

13.3, 14.0, 13.9 

III 

13.0, 11.9, 13.4 

1 14.0, 15.6, 13.8 

12.7, 12.9, 13.1 

12.4, 13.7, 13.0 

IV 

12.8, 13.8, 13.7 

1 14.1,13.2,15.3 

14.2, 13.6, 13.3 

12.0, 14.6, 14.0 


Show that there is a highly significant interaction between chemical treatments and ferti- 
lizers. 

Hint The error SS is that within the sub-classes. 

ii. Suppose that the data represent 3deids in an experiment with two treatments N and 
P each at two levels. Only two treatments are applied in each block, as indicated. 



Show that there is a partial confounding. In the first pair of blocks the N effect is con- 
founded with block difference, in the second pair the P effect is confounded and in the third 
pair the interaction. Estimate the main effects and the interaction. 

Hint In calculating the sums of squares to estimate any effect, use only the blocks in 
which that effect is not confounded. Of the total 11 df, 5 are between blocks, 2 for main 
effects, 1 for interaction, and 3 for error. The SS for the N effect is 
i X K65 -f 25 - 55 - 30)* = 3} 
with 1 df, and similarly for P and N XP> 

12. The following data (slightly simplified) represent yields in bushels per acre of 4 vari- 
eties of flax grown m 3 randomized blocks at 2 distinct locations for 2 years. Carry out a 
complete analysis of variance, separating the main effects and the interactions beween 
varieties, blocks, locations and years. Note that since the blocks are numbered arbitrarily 
there is no connection between, say, block 1 at location G and block 1 at location E. No 
maiu effect for blocks is to be expected, and all the interaction terms involving both blocks 
and varieties may be pooled to give the estimate of error. 


Yeab 1948, Location G 


Yeae 1949, Location 


Faneto 


VarietieB 


G 
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Year 1948, Location H 


Year 1949, Location H 


s 

Blocks 

Varieties 

Blocks 

Varieties 

A 

B 

C 

D 

A 

B 

C 

D 

1 

16 

14 

19 

20 

1 

13 

14 

13 

14 

2 

15 

18 

21 

24 

2 

12 

11 

17 

15 

3 

17 

19 

23 

22 

3 

15 

13 

14 

17 


13, Assume that m the data of Problem 12 the blocks now represent distinct treatments, 
the same for each year and each location. Carry out the complete analysis of variance, sep- 
arating out all first order, second order and third order interactions. (The third order 
interaction mean square is now the only valid estimate of error for testing the other inter- 
actions.) 
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CHAPTER X 


MATRIX ALGEBRA AND THE METHOD OF LEAST SQUARES ^ 

10.1 Introduction. Problems of regression and correlation in several vari- 
ates require the solution of a set of simultaneous linear equations. The 
numerical solution of such a set of equations is greatly facilitated by the use of 
an eflSicient technique. Moreover, the computation of the standard errors of 
the variables concerned requires the solution of other related sets of simul- 
taneous equations, or alternatively the calculation of the inverse of a matrix. 
It is the main purpose of this chapter to give an outline of matrix algebra, as a 
useful tool in many statistical problems, and of certain computational methods 
which are convenient in numerical work. 

10.2 Normal Equations. Let us suppose that we have a set of N observa- 
tions on each of p + 1 variates, Xi, X 2 , • ‘ * Xp and y (N > p), and that we wish 
to find the “best^^ linear predicting equation (in the least squares sense), 

(10.1) Y = biXi -j- 1)2X2 -}“*•* -4" bpXp 

for obtaining estimates of y to be associated with future values of Xi, 0:2, * Xp. 

For example, 2 we may have observations of longitude (xi), latitude (X 2 ), 
altitude (a^s), and rainfall (y) at 57 weather stations, and wish to estimate 
rainfall at other places. The Xt are the 'predictors, and y is the dependent 
variable or predictand. Equation (10.1) is called a multiple regression equation, 
and the coefficients 61, 62, * • • hp are the partial regression coefficients of y on 
xi, ^2, • • • Xp respectively. They are, of course, estimates of the true regres- 
sion coefficients &, • • * and as such are subject to sampling errors. 

The whole set of observations may be represented as in Table 28, where 
rCja is the ath observation on the jth variable (a = 1, 2, • • - iV,; = 1, 2, • • - p). 

Table 28 

Xii Xn * * • Xia • • » Xitf 

Xn X2% ' • ' X2a * * * X2lf 

X]i Xj2 • • * Xja * * * X,2if 


Xpi Xp2 • • • Xpct • • • Xpif 

2/1 yt ‘ • yet * Vn 
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The true regression equation may be written 

( 10 . 2 ) r, = 

It is assumed that the Xi are either fixed numbers or variates with errors 
which, are negligible in comparison with the error in y. If the 

difference between the observed and theoretical values of y for the ath observa- 
tion, we suppose that the A« are independently distributed about zero with a 
common variance <7^, and that the bi are chosen so as to make the sum of 
squares of the A^^ a mimmum* We shall use the symbol S to denote summa- 
tion with respect to a and the symbol ^ denote summation with respect 
to j (or other Latin subscript) - 
Imposing the condition 

(10.3) = min 

differentiating (10.3) with respect to fii, /32, • * • /3p, and equating the deriva- 
tives to zero with == b^, we have, as the least squares estimates of the 
the values hj given by 

^^jaiya = 0 or 

(10.4:) 'jgXka ~ ^ = 1, 2, * * * p 

This is a system of p equations in the p unknowns bx, ^ 2 , • • • bp. Written 
out in fiaU, they are 

biSxiJ + h2SxxaX2ct + • • • + bpSxxaXpa = Sxj^y^ 
blSz^aP^la "t" b2SX2a^ -j- bpSxZa^poc ~ >S'X2a2/a 

hSXpaXicc + bSXp^X2c + • * • + bpSXpJ = SXpaPa 

The system is called the nonml equations of the problem. It is clear that 
the coeflS.cient of bj in the kth equation is the same as that of bk in the jth 
equation. This symmetry in the coefficients is characteristic of normal 
equations, but any system of p equations in p unknowns may be put into the 
symmetrical form by a preliminary transformation. (See § 10.11.) 

It is of interest to note that the same set of normal equations is obtained by 
considering the regression problem in different ways. Thus, we assume 
normality for the A^s, the joint probability density is (cr V27r)‘"^e“'^^«/^‘^, 
so that the likelihood function is 

(10.6) , L^C-Nloga- S(AJ/2a^) 

The condition of maximum likelihood is therefore equivalent to that of mini- 
mum SAJ. 

Again, we may consider the Vs as linear functions of the observed y% 
chosen in such a way as to be unbiased estimates of the with minimum 
variance. (The less the variance the greater the precision of the estimate.! 
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The reason lor choosing linear functions is that we want the predicting equa- 
tion to be independent of any change of scale in the If we put 6, = 
and choose the c^a to satisfy the conditions Var (6j) = min, it 

may be shown ^ that we arrive at equations (10.5). 

10.3 Matrix Algebra. The solution of a system of normal equations like 
(10.5), and various problems related to this, are conveniently and concisely 
expressible in matrix notation. In practice some suitable computational tech- 
nique is required, but the use of matrix algebra facilitates the handling of 
theoretical problems. We now give a brief account of this algebra. 

A system of mn elements arranged in a rectangular array of m rows and n 
columns is called a matrix of order m X n or an m X n matrix, ^Tien m Uj 
the matrix is said to be square. Thus, the set of observations in Table 28 
forms s. (p + 1) X N matrix, and the set of coefficients of the Vs in (10.5) is 
a square matrix of order p, A matrix is often denoted by a single letter, as 


ail Ui2 * • • ain 
a^i a22 • * • a2n 


LOwl am2 * * * OmnJ 


To distinguish the matrix from a determinant it is enclosed in square brackets, 
parentheses, or double vertical lines. We shall adopt the first of these con- 
ventions. A convenient short notation is [aj*], m X n. 

The elements of a matrix are ordinary numbers, real or complex. In our 
work they will be real. An ordinary number, as opposed to a matrix, is called 
a scalar. In matrix algebra the whole matrix is regarded as a mathematical 
entity, subject to algebraic operations which have, of course, to be defined. 

A matrix is said to be null or zero if and only if all of its elements are zero. 

Two matrices A and B are said to be equal if and only if they have the same 
number of rows (m) and columns (n) and if = bjk for all j from 1 to m and 
all k from 1 to n. 

Addition. If A and B are both mXn matrices, the sum A + £ is defined 
as the matrix C = [cjklj mXn, for which Cjk == a^k + 6,*. Matrices which do 
not have the same number of rows and columns cannot be added, and are said 
to be not conformable for addition. 

Multiplication by a Scalar. If c is a scalar (an ordinary number) and 
A = [ajk]j mXn, the product cA is defined as the matrix P = [p,fc], m X n, 
for which p^k == ca,&. 

Multiplication of Two Matrices. If A == [a^,], m X n,;and B = [bjk], nXp, 
the matrix product AB is defined as the matrix C ^ [dk], mXp, for which 

j 


(10.7) 
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That is, the element in the ith row and kth column of C is found by multiply- 
ing together, element by element, the ith row of A and the fcth column of jB, 
and adding the products. It is necessary, therefore, that the number of 
columns in A shall be equal to the number of rows in B, if A and B are con- 
formable for multiplication. 

The operation of addition is easily proved to be commutative and associative. 
That is, if B, and C are conformable for addition, A + B = B + A and 
(A + B) + C ^ A + (B + C). 

The operation of matrix multiplication is associative but is not, in general, 
commutative. That is, if A, 5, C are conformable for multiplication, 
{AB)C == A(BC)j but AB may not be equal to BA. In fact, if A is of order 
m X Uj B must be of order n X m for both products to exist. AB is then of 
order m X m and BA of order nXn. Even if m = ti, there is no reason why 
should be equal to 

j j 

Example 1. 

G i] G -a=[.t a 
G GI G a-n -1] 

It is necessary, therefore, to distinguish between '^pre-multiplication^^ and 
"post-multiplication.^^ The product AB is often referred to as "JB multiplied 
by A on the left or as "A multiplied by B on the right, and similarly for BA. 

The distributive laws hold for matrix addition and multiplication, namely, 
A(B + C) = AJB + AC and (A + B)C = AC + jBC, the matrices, of course, 
being conformable. 

The 'product law does not hold. If AB is a zero matrix (of the appropriate 
number of rows and columns) it does not follow that either A or JB is a zero 
matrix. 

Example 2. 

[2 -11 ri 31 ro 01 
Lio -sj L2 6 j Lo oj 


10.4 Transposition. If the successive columns of matrix A are written as 
successive rows Qf a new matrix A', then A' is called the transpose of A. The 
fth column of ri' is the ith row of ri, and vice versa. If ri is an m X n matrix, 
then ri ' is an n X Jn matrix. For example. 


rS 6 21 


ri_ = 


2 10 
5 9 7 

h 0 6 . 


ri' = 


r3 2 5 li 
6 19 0 
_2 0 7 6- 


A square matrix is said to be symmetric if it is equal to its transpose. That is, 
ri = [ajiil, n X n,is symmetric if and only if a,k = m,, for j, k = 2, • • -n. 
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For example, 


-2 3 
3 5 
L -4 1 


1 is symmetric, and so is the matrix of the coef- 

eJ 


ficients of the 6’s in the normal equations ( 10 . 5 ). In a symmetric matrix, 
pairs of equal elements are situated S3unmetrically with respect to its principal 
diagonal (upper left comer to lower right corner). 

li A' = — A, the matrix is said to be skew-symmetric. If so, A must be 
square and the elements of its principal diagonal must all be zero. For ex- 


ample. 



1 2 - 
0 3 
-3 0 - 


is skew-symmetric. 


Theorem 10.1 {Reversal rule). 

{ABY = B'A' 

Proof: 

If c,)b' is the element in the fth row and jth column of {AB)', 

0\k “ Ckx ” ^^ak]h ]i 

But the element in the ith row and Mh row of 5 ' is bj^ and the element in 
the jth row and fcth column of A' is akg^ Hence the {i, k)th. element of B'A' is 


— C%k 

J 


Similarly (ABC)' = C'B'A'j etc. 

A square matrix in which all the elements except those in the principal 
diagonal are zero is called a diagonal matrix. Diagonal matrices commute 
with each other, if conformable. Thus, 


“flu 

0 

0 - 


r6ii 

0 

0 - 


0 

0 - 

0 

aj2 

0 

• 

0 

i>22 

0 

= 0 

022622 

0 

-0 

0 

033 - 


_0 

0 

hiz - 

-0 

0 

023623- 


0 0 - 


rou 0 0 1 

0 


0 

1 

0 

0 633- 


-0 0 aasJ 


A diagonal matrix with all the elements in the principal diagonal equal is 
called a scalar matrix. Multiplication on the left or right by a scalar matrix 
is equivalent to multiplication by a scalar. Thus 


rX 

0 

0" 


"an 

ai 2 

aiz 

aiC 

“Xoii 

X012 

Xai 3 

XOi4‘" 


Of 

X 

0 



an 

O23 

024 = 

= X021 

X022 

X023 

X024 

= XA 

-0 

0 

X- 


-.U31 

az 2 

^^33 

azi-.. 

-X031 

X032 

\a,zz 

X034- 



The Unit Matrix. This is a scalar matrix with X == 1. It is denoted by I, 
For any other matrix A, 

( 10 . 8 ) 


AI ^lA^ A 
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provided I has the proper number of rows and columns in each case, so as to 
be conformable. I behaves, therefore, like the number 1 in ordinary multi- 
plication. 

10.5 Matrices and Linear Transformations. 


The matrix 


yi 

2/2 


of m rows and one column, is called a column vector. 


Its 



transpose [yi 2/2 • * ' ymh of one row and m columns, is called a row vector. 
Let X = [xk]j n X 1, and y = [ 2 /,], m X 1, be column vectors and let A = [a, 
m X n, be a matrix of coeiEcients. Then the matrix equation 


(10.9) y — Ax 

represents the linear transformation 


( 10 . 10 ) 


yi == anXi + anX2 + • • • + amXn 

2/2 = (HlXl + 022^2 + (hnXn 


Vm = + • * 4 “ CLmn^n 


11 A = Ij the transformation reduces to yi = Xij 2/2 = ^ 2 , • • • 2/n = Xn, which 
is the identical transformation. 

If another linear transformation is made on the v% say 

J 

^ = 1, 2, •*•?>, the relation between the z^s and the x^s is expressed by 
Zt = ^K^ajkXk = ^(^btja^k)Xk = ^CtkXk, where c^k = ^hja^k^ Inma- 

j t k j i j 

trix notation, z By BAx = Cx, where C = BA, The rule of matrix 
multiplication is, therefore, seen to be quite natural in terms of successive 
linear transformations. 


The quadratic form ^a^kXjXk may be written in matrix notation as x^Ax^ 
tk 


where x' [xixz * • * Xn]j x = 


Xi 

X2 


and A is the matrix [a^J, n yin. This is 


easily verified by carrjdng out the multiplications according to rule. A is 
called the matrix of the quadratic form. This matrix is symmetric, since XjXk 
is the same as XkXj, 

m n 

In the same way the l)ilinear form ^ ^ajkXjyk may be written as x^Ay, 

jTi jfc^ 


vhere x and y are column vectors of m and n rows respectively and A is the 
m X n matrix of the bilinear form. If, in the bilinear form x'Ay^ we make 
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two linear transformations x = Hu and y = KVy the result is a bilinear form 
in u and v with matrix H^AK. To prove this, we have 

x'Ay = (u'W)A(Kv) 

= u'iH'AK)v 

by Theorem I’O.l. 

If the first p rows of Table 28 are denoted by X and the last row by the 
matrix of coefficients of the Vs in (10.5) may be expressed as 4 = XX\ The 
column vector on the right of (10.5) may be written g = Xy, and the whole 
set of normal equations may, therefore, be concisely represented by 

( 10 . 11 ) Ab--g 

where b = [6,], p X 1, a column vector. 

10.6 The Determinant of a Matrix. If A is a square matrix, of order 
nXnj the determinant of Ay d{A)y is a polynomial of the nth degree in the 

Uii • • • ain 

elements of Ay denoted by • - or for short by | a,** | . 

Uin • • * Unn 

It is assumed tnat the reader is familiar with the elementary properties of 
determinants, as given in most textbooks of college algebra, but we recall a 
few of these properties for convenience. 

The determinant obtained by omitting the ;?th row and fcth column of d(4) 
is called the minor of a,*, and will be denoted by d(A,jfe). The signed minor 

Cik - (-iy+^d(A,k) 

is called the cofactor of Ujk. The formula for the development of d(A) accord- 
ing to the jth row is 

(10.12) d(A) = 

and similarly for the development according to the A;th column, 

(10.13) d(4) = 

If in these formulas* we replace the cofactors of the jth row (Or the Mh 
column) by the cofactors of a different row or column (what Aitken^ has called 
alien cofactors) the expression on the right-hand side reduces to zero. That is, 

(10.14) -0, 3 9^1 

(10.15) = 0, k^l 

^ I* 

A convenient symbol for expressing such pairs of relations as (10.14) and 
(10.12) is the Kronecker delta, h,k, defined as equal to 0 when 3 k and to 1 
when 3 = k. The relations 
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Ik ' 
k 

and 

^cijkCji *= 8kid{A) 

"T 

then express the four equations (10.12) to (10.15). 

Even if a matrix is not square, determinants may be constructed from it by 
crossing out rows and/or columns to leave square arrays. All these are 
determinants of the matrix. If the matrix is of order n X n, the largest 
determinant is the determinant of the matrix, and is of order n. 

The rank of a matrix is the order of the determinant (or group of determi- 
nants) of highest order that is not equal to zero. A square matrix of order 
n X nis called singular if its rank is less than n. The determinant of a singu- 
lar matrix is, of course, equal to zero. 

MultipUcahon of Determinants, If A and B are two square matrices of 
order n X n, and if C = AB, then d{C) == d{A) • d{B), That is, the rule for 
multiplying together two determinants of the same order is the same as that 
of multiplying two matrices, except, of course, that a detemunant may be 
transposed without affecting its value and that the order of multiplication is 
immaterial. Thus, 


3 1 


4 1 


14 

8 

2 0 


2 5 


8 

2 


which is clearly true, since the determinants have the values —2, 18, and 
—36 respectively. 

The matrices AB and BA are, in general, distinct, but both have the same 
determinant. 

10.7 The Inverse of a Matrix. If a matrix A, n X n, is non-singular, there 
exists a unique nXn matrix, denoted by such that 

AA-i - I 


is called the inverse or reciprocal of A. 

The transpose of the matrix of co-factors of the elements of A is called the 
adjoint of A, denoted by adj A. That is, 

adj A = {C,kY = [Ck,] 

Hence A • adj A == = l^njiCkii == [5,* d{A)] = d(A)J, since the ele- 

ment of I in the jth row and fcth coliunn is 5,* and d(A) is a scalar. Hence, 
provided d(A) is not zero, 

(10.16) • , A/d(A) 

It can be proved in the same way that 

A~U = I 
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We have, therefore, an operation of matrix division defined for non-singular 
square matrices only. As with multiplication, we may pre-divide or post- 
divide JS by A, if 5 and A are conformable, forming and which 

are in general different. 

Theorem 10.2. The reversal rule applies to jinverdon^ namely ^ 

(10.17) (AB)-i = 

This follows since {AB){AB)~^ = J = = ABB~''^A~^. 

Theorem 10.3. The operations of transposition and inversion are commuta'- 
live, that is, 

(10.18) (A-i)' = 

For A'(A“^)' = (A“^A)' = /' = /, so that (A~‘^)' is the inverse of A'. 

The operation of inversion provides a solution in concise symbolic form of a 
set of simultaneous linear equations, such as the normal equations (10.5), 
which in matrix notation are written as in (10.11), Ah = g. 

To solve these equations for the unkno^vn Vs, we premultiply by A“^ 
Then A’^^Ah = A""^gr, or 

(10.19) b = A-V = (adj A • g)/d(A) 


The elements of A“^ are often written as Thus, (10 19) is equivalent to 


( 10 . 20 ) 


L I 


This is Cramer^s rulc^ named after Gabriel Cramer (1704-1752), a Swiss 
mathematician who first stated it. The numerator of the fraction on the 
right-hand side of (10.20) is the determinant of the matrix derived from A by 
replacing the elements of its jth column by gi, g 2 j • • Qp. 

A non-singular matrix A is orthogonal if its transpose is equal to its inverse, 
that is, if 

AA' = I 


The matrix ^ is orthogonal, 

Lsm 6 cos ^ ’ 

nal transformation 


and is the matrix of the orthogo* 


X = x' cos B — if sin B 
y — x' sin B + f cos B 


which corresponds geometrically to a rotation of the coordinate axes about the 
origin through the angle B, 

The general orthogonal transformation of § 4.13 was defined as 


2/t 

3 


f, J 1, 2, • * • 72, ^ 




where 
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If C is the matrix of this transformation the orthogonality condition is 
equivalent to 

CC = I 


so that the matrix of any orthogonal transformation is orthogonal. 

10.8 Numerical Solution of Normal Equations. The elegant theoretical 
solution by means of Cramer’s rule is not very useful in practice, particularly 
for p > 3. This is because the computation of determinants of high order 
by the usual expansion method does not lend itself to compact self-checking 
numerical schemes. In practice, solutions are usually found by some scheme 
of systematic elimination of the variables, one by one, a method attributed to 
Gauss and modified in the direction of greater compactness by Doolittle, 
Dwyer, and others.^ 

This method transforms the original set of p equations in p unknowns (say 
'Ui, • Up) to a set of p equations containing, respectively, p, p — 1, * • • 2, 1 
unknowns. Thus- 


( 10 . 21 ) 


tiiUi -f- ^12^2 -f- * * * “h ii,p—iUp—i -f- tipUp = hi 
fe^2 -f- * ‘ * + + t2pUp = h2 

-i,p— i^p — 1 ip — i,pUp ~ hp — 1 

tppUp “ hp 


These equations are then easily solved, one by one, beginning with the last 
and working up. This is known as the “back solution.” 

The matrix equivalent of (10.21) is 

(10.22) Tu = A 


where T is a triangular matrix, that is, a square matrix with all elements below 
(or above) the principal diagonal equal to zero. Written out, 



”^11 

tl2 

* * ^l,p--l 

tip 

T = 

0 

<22 • 

• * ^2,p~l 

t^p 


.0 

0 ■ 

• • 0 

tpp^ 


The first step in the Gauss solution is, therefore, equivalent to transforming 
a given square matrix to a triangular matrix. 

One method of doing this is known as the Square Root Method.^ If is a 
symmetric non-singular matrix, we find a triangular matrix S such that 

(10.23) S^S = A 

In a certain sense, ;S is a “square root” of A, If the matrix equation to be 
solved is Au = g, we have then 

» S^Su = g 

This is equivalent to the two equations 8'A — g, 8u ^ h, both of which are 
triangular and, therefore, readily solved. 
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Since S = 

~Sn 

0 

S12 * 

S22 • 

* * 

’ * ^2p 

, (10.23) is equivalent to 


.0 

0 • 

• Spp- 



Sir = ^11 
S11S12 = CI 12 
Si 2 ^ + S 22 ^ = (I 22 


Sij^ + S2j^ -j- . . . = Cjj 

SijSik + S2]S2k + • • + SjjSjk == djk 


whence we obtain explicit expressions for the elements of Sj namely, 


(10.24) 



= (anY'^ 



^12 

= CI 12 / 51 I 



S 22 

= (^22 - 




= (fl]] 

siY — S 2 ,- — • • • 


Sjk 

= ia,k - 

5'2jS2/c 

• • • 


The next step is to find k from S^k = or 


Siiki — Qi 

Snki + $ 22^2 = ^2 
Sijki + + * + S]jkj = Qj 


whence 

(10.25) 


' ki = gi/$n 

^2 = (§2 Sl2iti)/S22 


k'j (^j ^i]ki S2]k2 ’ * Sj~i,]kj^j^/sj] 


Finally, the u, are found from Su = k, or 

SiiUi -j- S12W2 -h * * * “h SipUp = ki 
, S 22 U 2 S2pUp — ^2 


giving 


sap'll p ““ kp 


%tp ““ k>pf ^pp 

Up^i — (kp—i *“ Sp— i,p'W-p)/ Sp_i,3;)_i 


(10.26) 


Uj ’ (kj Sj ^i^}~h2'^j+2 ’ * * ^;p^3?)/ 
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Example 3. Solve the set of equations 

[ 6.86 2.56 3.39-1 rwi-] rl.98-| 

2.56 8.92 1.78 • U2 = 2.93 
3.39 1.78 4.41J LmJ Llo6J 

The operations involved in these calculations are well adapted to machine computation. 
On most types of machine, for example, a quantity like 

djk SijSiA 527^21; * * * Sj—i./S,— i,* 

may be found m one operation. Square roots may be obtained from Barlow^s Table of 
SqmreSf Square Boots, etc,, or on the machine. The advantages of the method are more evi- 
dent when the number of unknowns is larger than in this example. 

The first step is to calculate the triangular matrix S, Only the result 

r2.619 0.977 1.294-1 

2.822 0,183 need actually be set down. The elements are obtained as 
L 1.644-. 

follows: 

sii == (6.86)1/2 = 2.619 

512 = 2.56/2.619 - 0.977 

5is = 3.39/2.619 = 1.294 

522 - [8,92 - (0.977)2]!/* = 2.822 

523 = [1.78 - 0.977 X 1.294]/2.822 = 0.183 

523 - [4.41 - (1.294)* - (0.183)*]!/* = 1.644 

The k matrix is similarly set down. 

ki = 1.98/2.619 = 0.756 

k2 = [2.93 - 0.977(0.756)]/2.822 » 0.777 

h = [1.06 - 1.294(0.756) - 0.183(0.777)]/!. 644 - -0.037 

Finally 

Ui =-0.037/1.644 =-0.023 

= [0.777 - 0.183(-0.023)]/2.822 = 0.277 
ui = [0.756 - 0.977(0.277) - 1.294( -0.023) ]/2.619 = 0.197 

Hence u' = [0.197, 0.277, -0.023]. 

Check Sums. It is very desirable in a long computation to have a series of 
checks as the work proceeds. Such checks are provided by forming a cblumiT 
vector g whose elements are the sums of the corresponding rows in A and g. 
Thus, 

(10.27) Qj = a,'i + + * * * + 

The second and third steps of the computation are repeated with g instead of g, 
giving new vectors k and u. Apart from errors due to the rounding off of 
decimals, which should not affect more than the last one or two places unless 
p is large, we should find 

(10.28) kj = 4,7 + + • * • + 

and 

(10.29) Ui « % + 1 
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A convenient tabular form for the square root method, including the checks, 
is set out in Table 29. The first row of S is calculated, followed by h and 
This gives the first check. Each row of S, in turn, has its own check. In 
calculating the elements of Uj Up is first obtained and then Up, followed by Up^i 
and Up^i, and so on, until the check of ui with + 1. As a final check, the 
values ui, * Up should be substituted in the normal equations. 


Table 29 




Q 

g 


6.86 

2.56 

3.39 

1.98 

14.79 

A 


8.92 

1.78 

2.93 

16.19 




4 41 

1.06 

10.64 





h 

1 


2.619 

0.977 

1.294 

0.756 

5.647 

S 


2.822 

0.183 

0.777 

3.782 




1.644 

- 0.037 

1.606 

u' 

0.197 

0.277 

- 0.023 



u' 

1.197 

1.277 

0.977 




Note that, in writing out the symmetrical matrix A, it is unnecessary to 
include the elements below the principal diagonal. These must be included, 
however, in summing to form the column vector g. We can do this, in effect, 
by summing down the column to each diagonal element and then across. 
The omitted elements in S are all zeros. 

Observe also that the checks all hold except for an occasional single unit in 
^ the third decimal place. 

In a long calculation it is advisable to retain two or three more significant 
figures than are desired at the end. Figures can always be dropped, but they 
cannot be replaced later in the calcidation. This applies particularly in 
Analysis of Variance problems, where the subtraction of two nearly equal 
numbers may mean the loss of several significant figures in one step. 

10.9 Calculation of the Inverse of a Matrix. As we shall see later, some of 
the elements of the matrix A~^ in (10.19) are required for testing the signifi- 
cance of the partial regression coefficients bf. The inverse matrix is used also 
in dropping a variable from the regression equation if it does not seem to be 
contributing much information. 

When the inverse matrix is required, a good method of solving the normal 
equation Au = gr is to invert A and calculate u = A~V''by matrix multiplica- 
tion. When only a few elements of are wanted, however, it may be a 
waste of time to invert the whole matrix. 

Since « /, the fcth column of 4"“^ is the solution IJk of the matrix 
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equation Alh = Giy where Gk is a column matrix of zeros, except that the 
element in the ijth row is 1. Thus, 



"all" 


■ 1 ■ 

u, = 


is given by solving the equation AUi = 

0 




_ 0 _ 


There are p separate sets of equations to solve, but they all have a co mm on 
coefficient matrix A, so that a compact solution is possible. The method here 
suggested (there are many others) consists of three steps: 

1. Finding the square root triangular matrix S, as described above. 

2. Inverting iS, (which is much easier than inverting A). 

3. Calculating .4-1 = {S'Sy^ = by Theorem 10.3. 

If 5^^ is the typical element of it is easy to see from the rules of matrix 
multiplication as applied to = /, that all the = 0 when j > fc. 
Hence is also a triangular matrix. Moreover, the diagonal elements 
5/7 = and when j < k, 

(10.30) ~ - • • • - s%k)/skk 

Thus, 

= l/aii 

a2i = 0 

. 5I2 — _ 5l%i2A22 
522 = 1/522 

etc. 

The simplest way to perform this inversion is probably to write /S"i in its 
transposed form, and to remember that the jth. column of (S-i)' multiplied by 
the Mh column of S is equal to 5,&. Thus, to invert the matrix S of Example 3, 
we should have 

rail 0 0 ’ 

§22 0 

L5I3 523 533 _^ 

where the elements are given by 

. all == 1/2.619 = 0.382 
a22 = 1/2.822 = 0.354 
^12 = [0 - 0.977(0.382)1/2.822 - ~ 0.132 
a33 = 1/1.644 = 0.608 
a23 = [0 - 0,183(0.364)]/1.644 = - 0.039 
ai3 = [0 - 1.294((K382) -- 0.183(-0.132)]/1.644 -‘-0.286 

These results are entered as obtained in the appropriate places in (5"“^', 
giving 
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0.382 0 0 - 

-0.132 0.354 0 

.-0.286 - 0.039 0.608- 


Finally, the j, ^:th element of A~^, is found by multipl 3 dng together the 
jth and fcth columns of (S~^y. Thus, 

= (0.382)2 + (-0.132)2 .|. (-0.286)^ = 0.245 
ai2 = -0.132 X 0.354 + 0.286 X 0.039 = - 0.036 
etc. 


The final result is 


- 0.245 -0.036 - 0.174- 
4-1 = -0.036 0.127 -0.024 

- -0.174 - 0.024 0.370- 


The solution of the equations of Example 3 may now be obtained from 


equation (10.19), u = A~^g, where g 


-1.98- 
2.93 , 
-1.06- 


r o.i95n 


The result is m = 


0.275 , 

L-O.O23J 


which agrees with the solution found before 


except for errors of rounding-off. 

Check Sums. The method of checking the computation, step by step, as 
described in § 10.8 may be used in calculating A~'-. The check for the first 
step consists in forming a column vector a whose elements are the sums of 
the corresponding rows in A, 

(10.31) a, = j = 1, 2, • • • p 

t 


The equation s'S = o' is solved for s' along with S'S — A (this involves 
only one extra colinnn) and the check is provided by 

Sj = 'Xhk, j = 1,2, ■■ - p 

k 


except for rounding-off errors- The explicit formula for Sj is 

(10.32) Sj = (ctj ““ S2jS2 - . . Sj—i,3Sj.^i)/S]j 

The check for the second step (inversion of S) consists in forming a row 
vector i', whose elements are all unity, and computing the row vector f 
for which fS = i'j so that 

(10.33) = (1 Sljtl ... — 

The check is 

tk = = 1, 2, • • • p 
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If the transposed matrix (S^^Y is computed instead of (as suggested 
above) it is convenient to compute the column vector t instead of f . The 
checks are then on the rows of 

Finally, the third step, may be checked by computing 

b' = The checks are provided by 

bj = j = 1, 2, ■ • • p 

k 

A compact tabular form for carrying out the whole inversion process, with 
checks, is indicated in Table 30. 


Table 30 . Inversion op Symmetric Matrix 


hi 

&) 

ai2 

Olp 

fll 

1)2 


/022\ 

a^p 


hp 



( 

\a^^j 

ap 

ti 

("0 


S ip 

Si 

U . 


/S22\ * 

S^p 

S2 

tp 

S^p 

S2p 

1 

\SJ>p) 

Sp 


The matrices A and A~^ are superimposed, but because of the symmetry 
of both matrices, only the diagonal spaces in the table have two entries. The 
omitted entries must of course be included in forming the check sums. In the 
same way S and (S'^Y are superimposed, but here the omitted entries are all 
zeros. 

The complete set of calculations required to invert the matrix of Example 3 
is given in Table 31, following the arrangement of Table 30. 

The sums of rows check with the s, tj and b columns within one unit in the 
last decimal place. The last figure of the sum is placed in parentheses after 
the figure it is supposed to check. 

10.10 Moving Decimal Points in Matrix Elements. If the elements of a 
coefficient matrix vary? in size by several orders, as may happen in multiple 
regression when the indep&dent variables are measured in widely differing 
nnits, the computations become difficult to handle. It is desirable to have the 
elements of the principal diagonal, in particular, between the limits of 0.1 and 
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Table 31 


0.036(5) 

0.068(7) 

0.172(2) 

/6.86 \ 2.56 3.39 

\0.245/ 

-0.036 /8.92 \ 1.78 

V0.127/ 

-0.174 -.024 /4.41 \ 

V0.370/ 

12.81 

13.26 

9.58 

0.382(2) 

/2.619\ 0.977 1.294 

4.891(0) 


V0.382j 


0.222(2) 

-0.132 /2.822\ 0.183 

3.005(5) 


\0.354/ 


0.283(3) 

-0.286 -0.039 /1.644\ 

1.643(4) 


V0.608/ 



10. This is achieved by choosing (by inspection) a diagonal matrix D whose 
non-zero elements are suitable powers of 10, and transforming A to B 
(= DAt)), The column vector g is transformed to A = Dg\ where X is a 
suitable scalar power of 10 chosen so that the elements of h are of about the 
same order as those of B. The equation Bv — hi^ then solved and the solu- 
tion of the original equation Au = g is given by 

(10.34) u = Dv\-^ 

The inverse of A is obtained from that of B by 

(10.35) A-^ = DB-^D 
The proof is as follows: 

= (DAD)-^h 
= D-^A-^D-Wg\ 

= D-^A-^g\ 

= D-^uX 

whence 

u = jDt;X~^ 


The proof that A~'^ = DB''^D is readily extracted from the above proof. 


Example 4. Given the set of normal equations 


r68,634 

25.61 338.8 “i 

pi 

25.61 

0.0892 0.178 

Xi 

L 338.8 

0.178 4.41 J 

Lojs- 

we may tajce 

rO.Ol 0 



!( 

o 

o 

0 


Lo 0 

iJ 

Then 

r6.86 2.56 

3.39 


B = 2.56 8.92 

1.78 


L 3.39 1.78 

4.41. 
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all numbers being rounded off to three significant figures. We obtain 


Dg 


so that a suitable value of X is 10, giving 


h 



The solution of the transformed equations (carried out in Example 3) is 


Then 


so that 



v' = 

[0.197, 0.277, -0.023] 


rO.Ol 

0 Oq 

r 0.197q 

u — 

0 

10 0 • 

0.277 


-0 

0 iJ 

L--O.O23J 


v' = [0.000197, 0.277, -0.0023] 


An alternative method of solution (which is really equivalent to the Gaussian 
elimination method and which avoids the necessity of extracting square roots) 
is to find a triangular matrix T, with diagonal elements equal to unity, and a 
diagonal matrix D, such that T'DT = A. If the matrix equation to be solved 
is Au = g, we have T'DTu = g, which is equivalent to the three equations 

ri = p, Tu-^k 

each of which is triangular (or diagonal) in form. 

If the matrices T and D are of the forms 



1 • 

' ^Ip 

T = 

01 • 

• • Up 

Loo • 

• -1 . 


"diO 

...O' 

D = 

0 dz 

, . .0 


-0 0 

’ * * dp^ 


the equation T'DT = A is equivalent to the set of relations 

di — an 
tndi ~ Ui2 
tndi = ai3 


t-j^di -f* C?2 — ^22 
tnhzdi + hzj'i an 
^ hz^di "d" ds ” ct33 

The total number of equations, |p(p + 1), is just sufficient to determine the 
|p(p — 1) values of Uj and the p values of di. 
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The inverse of the matrix A is then given by 

^-1 ^ 

where T and D, being triangular and diagonal respectively, are readily 
inverted. 

10.11 Non-symnxetric Matrix Equations. If M is a square non-symmetric 
matrix, and Mu = n, we may use a preliminary transformation given by 
Aitken^ to bring the equation to the symmetric form. This consists in 
multiplying through on the left by M'. If M'M = A and M'n = g, the result 
is the equation Au = g, in which A is now symmetric. Since 
= the inverse of M is given by 

(10.36) M-i - A-W 

Two additional steps of matrix multiplication are therefore required for the 
process of inversion. However, in most practical cases, the matrices that we 
need to invert are symmetric to start with, although non-symmetric matrices 
do arise in correlation theory. 

Another method of dealing with a non-symmetric matrix is to orthogonalize 
the rows. That is, given the square non-symmetric matrix M we find a new 
matrix N of which the first row is the same as that of M , the second row is a 
linear combination of the first and second rows of M orthogonal to the first 
row, and so on. If the elements of N are denoted by we have, therefore, 

^ 11^21 + ^ 12^22 -!-••• + nipn2p = 0 

or, in general, 

ntiUji + nt2nj2 + • • * + ntpUjp = 0 , i 9 ^ j 

It is then readily seen that NN^ = D, where D is a diagonal matrix. The 
transformation of AT to A' is equivalent to multiplying M hy a, triangular 
matrix T of which the diagonal elements are all equal to unity. Thus 
TM == Nj where 

“1 0 0 • • • 
y _ ^21 1 0 * • ' 

^31 ^32 1 * • • 

is equivalent to the set of relations 

'mil = nil, mi2 = ni2, • • • 

^21^^11 “t“ m2l — ^21, ^21^12 “h m22 ” n22} • • * 

If, then, we have the equation to solve 

Mu = n 

we can write TM{TMy - D, and {TM)'v == , 

so that 

' Dv = TMu = Tn 

V = D’^^Tn 
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and 

u = {TMYD-^Tn 
= M'TD-^Tn 

10.12 Improvement of an Approximation to the Inverse Matrix. It has been 
pointed out by Hotelling ® that, owing to the rapid accumulation of rounding- 
off errors in a long calculation, the solution of a set of p equations in p un- 
knowns with coefficients of the order of unity may possibly be in error by as 
much as 4^-^ times the maximum error in the coefficients themselves. The 
situation is similar in the process of inverting the matrix of coefficients. It is 
obvious, then, that if p is at all large the number of significant figures in the 
jfinal result will be seriously cut down. An iteration method of improving a 
fairly crude approximation is desirable. 

Let us denote A~^ by C, and let Co be an approximation to C. Then a 
better approximation is 

(10.3?) Cl = Co(2/ - ACo) 

and this approximation may be improved step by step, by calculating at each 
step 

(10.38) = Cn.(2I - ACm), m = 0, 1, 2, • • • 

One or two steps will usually be sufficient in practice. 

If Do = — ACo, Do will be a matrix whose elements are all Small, assum- 

ing that Co is a reasonably good approximation to A~^, The size of a matrix 
is estimated by its norm, which is the positive square root of the sum of 
squares of its elements. We now prove that 

(10.39) C,. = A-^[I ™ (DoYl 

This is clearly true for m = 0. If it is true for m, it is trufe for m + 1, since 
then 

C,.+i = C^(2I ~ ACJ 

- A-^[i ~ (Do)n[/ + mn 
= A-ni - (L>o)2"^T 

It is, therefore, true for m == 0, 1, 2, • • •. Hence 

(10.40) C -- C,„ = A^KDoY" 

= Cod - DorHDoY”^ 

It may be proved from this that if the norm of Do, N(Dq), is equal to fc < 1, 

(10.41) NiC - Cm) < W(Co)(^)2”*(l - fc)-^ 
and so tends to zero as m increases. 

Example 5. Let 

1.0 0.4 0.5 0.6" 

1.0 0.3 0.4 
10 0.2 
1.0- 
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Suppose that a first approximation to the inverse of this is 


Then 


so that 




The second approximation is 

r 2.1 - 0.2 




1.3 


r2.1 - 0.2 - 0.8 

- 1 . 0 “ 




1,3 

- 0.2 

- 0.4 






1.4 

0.3 



- 




1.7- 




r-0.02 

0.02 

0 

o.or 



0 


0 

- 0.02 

0,03 



0.01 

- 0.01 

0 

- 0.02 



«- 0.02 

0.04 

-0 02 

0 - 



N(Do) = 

= 0.072 





- 1 . 0 “ 


“ 0.98 

.02 

0 

o.or 

j 

- 0.4 


0 

1 

- 0.02 

0.03 


0.3 


.01 

- 0.01 

1 

- 0.02 


1.7- 


-- 0.02 

0.04 

- 0.02 

1 - 


'2.070 


-0.190 

1.282 


-0.776 

-0.218 

1.398 


- 1 . 011 ' 

-0.355 

0.274 

1.692J 


which is not in error by more than two units in the third decimal place. One more step gives 
the result correct withm two or three units in the sixth decimal place. 

For further discussion of error control see references 8 and 9. 

10.13 Variance and Covariance of the Regression Coefficients in Linear 
Regression. If we suppose that the of § 10.2 are all independent and are 
distributed about their respective true values Va with a common variance 
we can prove that 

(10.42) EQ>,) = ft 
and 

(10.43) Cov (6„ hk) = 

We have, by definition, g, = Sx^aVa. and, by the solution of the normal equa- 
tions, b, = ^a’’‘gk. Therefore 

EQ}]) = '^a'‘’‘Sxkana 

* k 

= '^a’’‘Sxka^0iXu 
* r 

i t 

= = ft 


Also, by hypothesis, 

(10.44) 


Cov (y^, Vff) ~ 
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Accordingly 


Cov Qt;, gk) = SSxj^^^s Cov yg) 

6 

— fT 

(^^SXj0,Xkcc ~ 



i 

That is, the matrix multiplied by is the covariance matrix of the 
coefficients 6i, &<>,•• • 5^,. The elements of the principal diagonal give the 
variance of the respective coefficients. Equation (10,43) is one of the statis* 
tician’s main reasons for needing to invert .4 . 

The variance of a linear function of the Vs may be found from (10.43). 
In particular, if F = is a 'predicted value of y based on a new set of 

values Xi. xs, - • • Xp of the independent variates, 

Var (Y) = '^TjXk Cov (b„ bk) 

This is the variance arising out of the uncertainty in the coefficients 6^. To get 
the variance of the observed y which would correspond to the observed 
a;i, • ‘ • Xp, we must add so that 

(10.45) (fy^ = (r2[l + '^a^^XjXk] 

This indicates one of the dangers of extrapolation, since may become very 
large for values of Xi, • • • Xp far outside the range of the original observations. 

10.14 Residuals. The difference between the observed and estimated 
values of the independent variable y, corresponding to a given set of observa- 
tions of the x^s, is called the residual, 

(10.46) - F« 

It is not the same as the error which is defined by 

Aa 

If V is the column ^vector with elements Vi, V 2 , ^ • vn, and X is the matrix 

P X iV, then 

Xv X(y - Y) ^ g - XX'h g - Ab 
since g = Xy and F = X'b. But the normal equations give Ab — g,m that 



Sec. 15 Distribution ot Sum oi Squares of Residuals 311 

(10.47) At/ = 0 

The residuals are, therefore, said to be orthogonal to each ot the predictors. 

In the ordinary notation, (10.47) is written 

(10.48) = 0, i = 1. 2, • • • p 

If, for example, = 1, for all a, which means that there is a constant term 
in the regression equation, (10.48) becomes for j — 1 

(10.49) So^ - 0 

This provides a useful check on the residuals. 

In matrix notation, SvcT may be written as vv, tvhich is a scalar, and we 
therefore have 

StJ = (^' ~ V')v 
= y'v - Y’v 

= y‘p 

since 

3’'*; = b'Xv = 0, by (10.47) 

Hence 

v'v = j/'(j/ — Y) 

= y'iM - X'h) 

= y'y - g'b 

which in scalar notation becomes 

(10.50) SvJ = SyJ^ - 2b, g, 

This is usually a convenient wa^^ to compute the sum of squares of residuals, 
provided enough figures are retained in the bj. There is a danger that in tiie 
subtraction almost all the significant figures will be lost. It is worth while 
actually to compute the separate residuals, in any event, in order to make 
sure that they do not show any systematic tendencies. 

10.15 Distribution of Sum of Squares of Residuals. We now prove that 

(10.51) E(SvJ) = (N ~ 

so that an unbiased estimate of cr- is provided by 

(10.52) §2 ^ SvJ/iN - p) 

This is not the same as the maximum likelihood estimate, which has N instead 
of N — p, and is found by maximizing L simultaneously with respect to cr*-, 
01 , • • ‘ ^p, assuming that the are independently and normally distributed 
about zero with common variance <tK The proof of (10.51) does not require 
the assumption of normality. 

E{y^ = E{ria. + = nc? + 
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so that 

EiSyJ) = iV'cr2 + 

Also 

jbk 

J.ic 

so that 

E(Z^,g,) = '^a^EibM 

J,k 

= by (10.43) 

J,k 

Since = P? we have on substituting in (10.50), 

EiSvJ) = (AT ~ p)cr^ 

The estimate in (10.52) is usually preferred to the maximum likelihood 
estimate, not only because it is unbiased, but chiefly because its distribution 
on the assumption of normality is the same as that of the sample variance, 
discussed in Chapter VII, with n = N — p degrees of freedom. That is, 
(N — p)s^/(t‘^ has the distribution with V — p degrees of freedom. More- 
over, with the same assumption, the distribution of is quite independent of 
that of the &^s (which are normal variates), so that we can use Student^s 
^-distribution to fix confidence limits for the /S^s. 

We first prove that Cov bj) = 0 for all a and all Since 

^ot ~ Va Va 

= Aa "b 

(10.53) = K - '^(b, - &,)x^ 
and since EQ), — ^,) = 0, we have 

(10.54) EM = 0 
Also 


(10.55) Cov M h,) = EM, - p,)} 

= E{M> -P.)} - El^(b, - mh - 

k 

Now 

E{Mi - ^;)} = EiAj>,) 

= EiX^a^M 

k 

= E{\,'^a’’‘Sxks{ri0 + Ap)} 

* ^ 

= '^a’’‘SxkeE(AaA0) 

k ^ 

* = '^a’^Xk„a^ 

by (10.44). Also 

E\(b, - - ffk)] = Cov (b„ h) = a %2 
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Substituting these results in (10.55), we obtain 

(10.56) Cov {v^, b,) = 0 

This, of course, proves only that the residuals and the regression coefficients 
are uncorrelatedj not that they are independent. But if the A« have a multi- 
variate normal distribution, so have the ya, which differ from the only by 
constants. The Qj are linear functions of the ya, the bj are linear functions of 
the Qj, and the are linear functions of the 6,. Hence the Va and the 6, are 
expressible as linear functions of the Aa, and so they too have a joint multi- 
» variate normal distribution. The absence of correlation for this distribution 
implies that the joint frequency function can be split up into a factor depend- 
ing only on the and a factor depending only on the In other ivords 
the Vs and the v^s are independent. 

From (10.53) and '(10.48) 

SVa^ = SAa^ i3,)(6fc - ^k)SX3aXka 

= — 2(^3 ““ ?k)CL]k 

= SAa^ - ^ajkUjUk 
i,h 

where Uf = 6, — pj. Let us now make a linear transformation of the vari- 
ables u to new variables w, chosen so that the quadratic form ^ajkUjUk reduces 
to a sum of squares. That is, if 

(10.57) Wk ~ 

I 

then 

(10.58) SVa^ = SAa^ - ^Wk^ 

Since Cov (u„ we have 

(10.59) Var {wk) = 

I m 


Now if we actually carry out the transformation (10.57), the first step is to put 


for then 


Hence 


Wi^ = {aiiy^^aijayMjUk 

V V 

= (aiO'^OiiW + 2aii^aijUiU, + ai,aikUflik] 

M-2 


P V 

Ctjk'U/^k ~ Wi^ “j- byk^'/^k 


where bjk = {ana^k — aijau)/an. In exactly the^ame way 


p p 

2 ^ikUjUk = W2^ + 2 ^^jkUjUk 
J.k^2 j,k’~Z 
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where {hnyi^, and so on. Hence, in (10.57), 

Xi! = au/iany^ 

Xsj = 621/ (622)'^*, etc. 


where 622 = 


flu 


ail ai2 

asi a22 


, etc. 


If Dr is the determinant of the first r rows and columns of 4, so that 

Di = an 

Uii au j 

^21 022 


D. = 


; etc. 


we have 622 = D2/D1, Czz = Dz/D2i etc. From (10.59), 
Var (wi) = '^aiiair^<T^a‘”‘/an 

l,m 


Var (W2) = a-^'^b2ibim.a^”'/hn 


-2 


= 7 ^ 2a*”(aiia2i - ai2aij)(aiia2m - a^Oim) 

0”^ ’fjrt 

= — — ^a^^{an^a2ia2m ““ OnOi3.a2iaim ~ anOi2aiia2m + ai2%iiaini) 
■UiJJ2 TjH 

{0x^022 ““ aiiai2^ — aiiai2^ + ai2^aii) 


D1D2 

~ <r2(aiia22 — 0x2^) /D 2 = <r^ 

In the same way, each of the new variables may be shown to have the same 
variance o-^. 

Also, being linear functions of the fe^s, they are independent of the We 
have, therefore, 

S(t»„V(r2) = SiAyjtr^) - 


where the terms on the right are the sums of squares of N and of p independent 
normal standard variates respectively, and are therefore distributed as x^> 
with N and p df . It follows that S(vJ/<r^) is distributed as with iV — p df . 

10.16 Confidence Limits for the True Regression Coefficients. Since bj is 
normally distributed about pf with variance and since is an unbiased 
estimate of with a distribution, independent of the it follows that 

(10.60) {b,- ^;)/s(a^’0'^' 


has Student^s f»distribution Vith iV — p df . If, therefore, ta is the value of t 
corresponding to the confidence coefficient 1 — a, the confidence limits for 
Pi are given by 
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(10.61) b, - sia^yX < /S, < 6, + s(a"y'X 

If bi and 62 are two of the regression coefficients, the variance of the difference 

is given by 

(10.62) Var (bi ~ 62 ) - Var (/ ) + Var (h) - 2 Cov ( 61 , 62 ) 


Hence on the null hypothesis that jSi = ^2 the quantity 


(10.63) 


bj — bo 

+ a22 - 2ai2)i/2 


has the Student distribution with N — pdf, and so may be used to test whether 
the two coefficients differ significantly. 

Sometimes the same set of predictors may be used for different sets of 
An example is given in Fisher^s Statistical Methods for Research Workers 
( 10 th Edition, p. 136), where the yields of grain from two adjacent plots of 
land, differently treated with fertilizers, are compared over a period of thirty 
3 ^ears. If the 3 n.elds are estimated by F — a + bx and = a' + b^x, where 
X represents time, the question is whether 6 — 6 ' is significantly different 
from zero. Owing to the strong correlation between 7 and 7' we do not 
compute the values of b and b' separately. Instead we take 7'' = 7 ~ 7' = 
a" + b^'x, and test the significance of b" from this third regression equation. 

If it is necessary in a multiple regression problem to calculate two equations, 
for different sets of y^s, a good procedure would be to calculate and then 
use the relations 

bi = '^a’’‘gk, b, = 

k k 

where Qk and gk correspond to the two sets y and y respectively. 

10.17 Omission of Variates in Multiple Regression. Suppose that we have 
found the regression of y on the variables Xi, X 2 , • * • Xp and that we would like 
to drop Xi, since it does not seem to contribute significantly to the regression. 
That is, we know the b^ as given by the normal equations 
v 

k = 1,2, ■ p 

J = 1 

and require the b/ given by 

V 

'"^bj Gjk = QL) /t’ == 2, 3, * ♦ • p 


If we let* 56/ = 6 ; “ b/, we have on su])traction 


(10.64) biav: + == 0, A* = 2, 3, • * • p 

But if ajK is the typical element of A-~\ we know that 
(10*65) = 0 , A; = 2, 3 * • ' p 
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Hence the coefficients of Oii, chk, • • • a^h in (10.64) are proportional to those 
in (10.65), so that 

— dhj/a}^ 
or 

( 10 . 66 ) 56 / = 

Hence, to drop the variable Xi, we reduce each 6/ by If the matrix 

A has been inverted these corrections are easily obtained. 

The variance of the new coeflBcients 6 / is given by 

(10.67) Var ( 6 /) = Var ( 6 , ~ aHi/a}^) 

= Var ( 6 ,) + Var ( 6 i) - 2(a^^ya^^) Cov ( 6 i, 6 ,) 

In general the jfcth element of the new inverted matrix is 

(O^'yh =- _ QllQlk^Qll 

10.18 Solution of a Set of Linear Equations with More Equations than Un- 
knowns. One of the oldest applica- 
tions of least squares methods is to 
problems of surveying, where various 
check measurements are made and the 
results adjusted by calculation. Thus, 
if 0, il, Bj C, D (Figure 31) represent 
survey stations, and the angles dij ^ 2 , ^3 
between OA, OB, OC and OD are re- 
quired, the surveyor at 0 can take six 
different angular measurements, as 
Fig. 31 "" shown in Table 32. 



Table 32 


Observation 

Stations 

Measured Angle 

Adjusted 

Angle 

(seconds) 

1 

A, B 

62'’59'40.3" 

40.35 

2 

A, C 

64'’11'35.0" 

34.40 

3 


100°20'29.1" 

29.65 

4 

5,C 

1°11'54.0" 

54.05 

5 

B,D 

37°20'49.3" 

49.30 

6 

C,D 

36° 8'65.8" 

55.25 


Let the corrections to *be added to these measured angles be denoted by 
Ai, A 2 • • • Ae. Then, if the measured angles are ai, ^ 2 , * * • <^ 6 , we have a set 
of observation equations: 
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( 10 . 68 ) 


= ai + Ai 

■f" 62 — "h ^2 
01 62 — oi3 -{• A* 

02 = ai + A4 
02 + 03 = as + As 

03 — as "I" As 


It is obvious by examining the data for consistency that the errors are only 
a few tenths of a second in magnitude. If we write 

01 = 62°59'40" + 2 

02 = 1°11'54" + y 

03 = 36° 8'55" + z 


the equations for x, y and z (measured in tenths of seconds) y/iU be 


(10.69) 


x= 3 + Ai 
X y — 10 + A2 
X y z ^ l“i" A3 
y = 0 + As 
y + z = 3 + As 
3 = 8 “b As 


The general pattern of such a set of equations is, therefore, 

n 

(10.70) + Aa, a = 1, 2 * - * ?7l 

where we have m equations in n unknowns (m > n), and where the and the 
hct are known constants. The Xj are adjusted so as to make SAa^ a minimum. 

Writing /SA^^ = ~ differentiating with respect to Xj, and 

equating the derivatives to zero, we get for the estimated x. 

ScjccC^Cj-^Xj — = 0, j = 1, 2 • • - n 

Writing a,jb == ScjaCka, Qj — ScjJCccj we arrive at the normal equations: 

(10.71) ^^^^OjkXjc “ Q 2 ) j = l,2***?i 

k, 

If .4 = [ajJ, n X n,x — [x^], n X 1, andgr = n X It these may be written 
as a matrix equation Ax = with the solution x == 4“^. Moreover, if 
the Ad have a common variance and are independent, is the variance- 
covariance matrix of the x„ so that, for example, the standard error of Xj is 
As before, & is estimated from the residuals by the equation 

(10.72) ^2 ^ SvJ/im - n) 

where 

(10.73) = jfca — ^CjiXj 
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Also 

(10.74) SchaVa - gk — = 0 

by (10.71), and so 

SvJ = Svc-Qc^ - 

= Sv^k^, by (10.74) 

(10.75) = SkJ ~ 

In the above example the c,o, are all either 1 or 0. The normal equations are 

Zx + 2y + z = 14 
2x + 4y 2z 14 
x + 2y + Zz^l2 

which have the solution x ~ 3.5, y = 0.5, z == 2.5. 

The residuals are --0.5, 6.0, —5.5, —0.5, 0, and 5.5, so that Sv^? == 97.0. 
As a check, SkJ' = 183 and '^QjXj = 14 X 3.5 — 14 X 0.5 — 12 X 2.5 = 86. 
Hence our estimate of is 97/3 ~ 32.3. The diagonal elements of are 
0.5, 0.5 and 0.5, so that the standard errors of x, y and z are each [(32.3) /2]^'2 
= 4.0 (in tenths of a second of arc). The best values for Bij $ 2 , Bs are, therefore, 
C2‘^59'40.35'^ ± 0.40^ nr54.05" ± 0,40", 36°8'55.25" ± 0.40". 

10.19 Weighted Observations. An observer will sometimes allot different 
weights to different observations, the weight being an estimate of the precision 
of measurement. Thus, if a particular measurement is the mean of four 
readings of equal accuracy, the variance of the mean would be one quarter 
that of a single reading, and the mean might therefore be given a weight of 4 
as compared with a weight of 1 for a single reading. The precision is here 
regarded as inversely proportional to the variance. 

If the observations in (10.70) have weights we may regard each equa- 
tion as equivalent to identical equations. Alternatively, we may multiply 
each of the observation equations by the square root of the weight and treat 
the new coefficients just as we did the old ones. The normal equations remain 
unchanged, but now 

Cl]k — SWctCjocCka 
Qj = SWc^Cjaka 

The residuals now satisfy the relations 
SWaCkaVa == 0 

Sw^vJ = SwJcJ - 

The estimate of v® is mS{WaVj)/[{m — n)/SwJ, which reduces to the value 
in (10.72) when all the are equal. 

10.20 Condition Equationsf or Equations of Constraint. In the solution of 
a get of observation equations there may be one or more exact equations which 
must be satisfied by the adjusted values. Thus, if the four angles of a quadri- 
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lateral are measured, it is known that the true values must add up to exactly 
360°, If the measured angles are ai, ao, as, 0:4, with weights Wi, W 2 j wzy u’4, 
the corrections Xi, • ^ to be added, are given by the observation equations 
xi = Oy X2 = Oj Xz = Oj ^4 = 0^ with the condition equation 

Xi + X 2 + Xs -4- ^^4 = 360° — (ax “b 0^2 "f* CK3 "b ^4) = c, say. 

The condition equation is used to eliminate one of the variables, say 2:4, and 
we then have ri = 0, 2:2 = 0, X3 = 0, 4- X2 + 0*3 = e, with weights W 2 } 

respectively. The normal equations are 

I (Wi + Wi)xi + + WiXz = w^e 

(10.76) ] tOiXi + (u?4 + ^' 2 ) 2:2 + w^xz = w^e 

I W 4 X 1 + W 4 X 2 + (wa + Wz)xz = lo^e 


whence ^ being over all values of i 

from 1 to 4. Then X4 is given by the equation of condition as {e/w 4 )/^{l/tVi), 
The standard errors of xi, Xo, X3 are found by the usual rule. That of X4 is 
obvious from symmetry, but may be found by repeating the calculation with 
one of the other variables eliminated instead of X4. 

In the general case, if we require to minimize SWai^CjaXj ~ subject to 
the q linear constraints (q < m) expressed by 

(10.77) i = 1, 2 • • • g 


the procedure is to use Lagrange multipliers^*^ Xi, X2 • • • Xg, and minimize 
(without constraints) the quantity 


hSw^(^c,„Xj - 

^ J 

The 34 is introduced simply as a matter of convenience, to avoid the occurrence 
of a factor 2 in the normal equations. We thus obtain the equations 

(10.78) '^ajkXjs + = gj, j = 1, 2 • • • n 

K i 

where 

ajk = SlVaCj^Ckcc, gj = Sw^Cjakcc 


Equations (10.77) and (10.78) together give n + q equations for the n + q 
unknowns Xk and X*. The matrix of coeflBcients is 


Cn • 

axn 

6u 

• • * hql 

ani * 

* a^n 

&ln 


hn- 

* * hxn 

0 

• • 0 

^ 0 ^ 

hqn 

0 
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In the example of the quadrilateral given above, the set of equations will be 

WiXi -f- X = 0 
^ 2 X 2 + X == 0 

(10.79) wzxs + X = 0 

W4:Xi + X = 0 
. ”h ^2 “h ^3 “h ^4 = ^ 

The solution is as given above, with X = — and the standard 

errors are easily found from the matrix of coefficients, which has five rows and 
columns. 

^ Problems 

1. Prove that the derivative of d{A) with respect to an element is the co-factor of 


2. If the elements of the determinant ^ differentiable functions of x, prove 

17 6 1 

that the derivative of the determinant with respect to x is equal to 

fa' ^ I , I a ^'1 


where a' = dajdx, etc. Generalize this result for the n-rowed determinant ] a,* ]. 

3. Prove that the determinant of a skew-symmetric matrix of odd order is equal to zero. 

4. Verify by computation that the matrix C of the Imear orthogonal transformation 
(4.82) satisfies the equation CC' « I. 


6. Compute AB and BA, if 


L O 0 4 0 
0 0 0 2 J 

6. Calculate the inverse of the matrix 


0 0 0 

1 0 0 

2 10 

4 0 1. 


r 1 2 -2-1 

-13 0 

- 0-2 1 - 


Check your answer by computing AA~K 
7. Solve the normal equations ® 

rl.OOO 0.313 0.280* 
0.313 1.000 0.652 
Lo.280 0.652 1.000. 


0.280“! rwi“] r0.495“| 

0.652 M ^2 = 0.650 

LOOOJ LuzJ L 0 . 8 O 3 J 


Jins, n' * [0.271, 0.158, 0.625]. 

0. Compute the inverse matrix of coefficients and solve the system. (This problem and 
the answers are due to D. B. De Lury.) 

575.88ni + 227.28^^ + 429.26na = 1600 

227.28% -f 781.10% + 1683.56^8 « 19100 
429.24% + 1683.56W2 + 10962.36n8 == 2400 

r 19.3370 -6.0638 0.1624-] 

4-1 = -6.0644 21.0097 -2.9892 X lO"* 

L 0.1624 - 2.9891 1.3649J 

u' = [-8.4020, 38.4408, -5.3557] 


Ans. 
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9. The quantities a, 6, c, d are to be determined from the following measurements, all of 
equal weight: a — 6 = 1168, a c — 1877, b — c = 712, c — d — 669, b — d = 1377, 
a — d 2547, d = 165. Fmd by the method of least squares the best values for a, 6, c, d 
and their standard errors. 

10. The means of three sets of measurements, one on each of the angles of a plane tri- 
angle, are respectively 45’13'5" ± 5.3", 39"17'10" ± 8.4", 95‘=29'32" ± 11.6". Find 
the best values to assume for these angles. The stated errors may be taken as standard 
errors, and the means are to be weighted inversely as the variances. 

11. Solve the following set of normal equations by the square root method: 


Mx + ISy - 502 - I2u + \7v -693.7 

18a; + 851/ - 452 + - 15t; = -812.7 

-50a: - 452/ + 842 + 43w - 26i; = 2376.0 
— 12a: 4- 3i/ + 432 + 44u — 16i; = 2050.4 
17a: - 152/ - 262 - 16w + 129t; = 1307.8 


Hint Divide the coefficients on the left by 10 and the numbers on the right by 1000 
Multiply the answers obtained by 100. 

Ans. X = 12.761, y = 9.735, 2 = 36.438, u = 20.907, v - 19.525. 

12. The accompanying table gives the results of measurements at various dates on the 
velocity of light in vacuum, together with an estimate by R. T. Birge “ of the probable error. 
The velocities are the excess in kilometers per second over 299,000 km/sec. 


Date Velocity Probable Error 

1874 990 200 

1879 910 50 

1882 860 30 

1882 853 60 

1902 901 84 

1906 784 10 

1923 782 30 

1926 798 15 

1928 786 10 

1932 774 4 

1936 771 10 

1937 771 10 

1940 776 6 


Find the weighted average and its probable error (the probable error is assumed to be 0.6745 
times the standard error). 

Ans, 299,777.8 ± 2.6 km/sec. 

13. The four angles of a plane quadrilateral are measured as 

A = 10in3'22", weight 3 
B = 93^4917", weight 2 
C = 87*^ 5'39", weight 2 
D - 77‘^52'40", weight 1 

Adjust these results, and find the standard errors of the adjusted values. 

Am. The seconds of arc are A = 14" ± 14", B ^ b" ± 17", C = 27" ± 17", 
D = 15" ± 20". 

14. In a plane quadrilateral ABCD the following angles were measured. All the measure- 
ments may be assumed of equal weight. Adjust these observations. 
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CAB = 64“ 8'34", DAC = 41“58'47" 

ABC = 66“34' 9", ACD = 53° 53'50" 

BCA = 49°17'23", DAB = 106° 7'30" 

CD A = 84“ 7a8", BCD = 103° 11' 3" 

Hint Denote the true values of the first six angles by ^2 • • • 06. These are subject to 
two independent condition equations. Use the second of arc as a unit. 

15. Solve the equations (10.76) and compute the standard errors of Xi, xz, Zi, Also 
obtain the same lesults solving the sj^stem (10.79). 

Ans, The standard error of Xi is equal to 

(tt?>K?3 + W 2 W 4 + 

(WiWaWz -f- WiWzWi + WiWzWi + W2WzW4y‘^ 

and the others are symmetrical expressions. 
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CHAPTER XI 


CXntmiNEAR REGRESSION; MULTIPLE AND PARTIAL CORRELATION 


11.1 The Correlation Ratio. In Part I of this work* the calculation and 
use of the correlation ratio are described. We recall that the correlation ratio 
is a measure of relationship appropriate to a bivariate distribution grouped in 
arrays. For convenience we consider a:-arrays (columns in the usual table), 
but with minor changes in notation the results apply equally well to y-arrays 
(rows). 

Let Nt be the number of observations in the fth array (2 = 1, 2 • — p), and 
let N = ^Ni. Let pt be the mean of observations in the ith array and y the 
general mean. Then the sample correlation ratio Eyx is defined by 

( 11 . 1 ) - y)yS{x, - yY- 


the denominator being summed over all the N observations in the sample. 
A similar expression holds for E^yj which in general has a different value 
from Eyx, 

It may be noted that if a straight line is fitted by least squares to the means 
of arrays so that the weighted sum of squares of residuals is a minimum (the 
weights being equal to the array frequencies) this line is the ordinary regression 
line of p on a: 

(11.2) Y -y = b{x-x) 

where h = rsy/sx. Also 

(11.3) S(j/ - yY = S(y - y, + y, - yY = Siy - y,Y + - yY 

since the iVi observations in one column have a common value of ijt — y. 
Hence from (11.1) 

Nsy^ = S{y - + NSy^Eyx^ 

or 

(11.4) E,Y = 1 - 
This may be compared with the formula 


(11.5) 


1 - 


S{y - 7)^ 

NsY 


and shows fliat 1 — is the proportion of the total sum of squares due to 


* Kenney, J. F. and Keeping, E. S., Mathematics of Statistics, Part One, D. Van Nostrand 
Co., Inc., 1954. 
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fluctuation about the line of column means, just as 1 — is the proportion 
due to fluctuation about the straight regression line. 

The usual notation for the correlation ratio 7]yx is here reserved for the popu-^ 
lahon value. If rj and E are taken to refer to either pair njyx and Eyx or rjxy and 
Exyj the distribution of E^ when 17 = 0 has been worked out by Hotelling.^ 
If the various ^/-arrays (or rc-arrays as the case may be) are normally dis-^ 
tributed with a common variance, then E^ is a §^ 2 ) variate, w^here 

m = p — I and — N ■— p. It follows readily that n 2 E^/ni(l — E^) has 
the F-distribution vith ni and n 2 degrees of freedom. 

The significance of an observed E^ may, therefore, be tested by means of the 
table of F, or K. Pearson’s Tables of the Incomplete Beta Function. A special 
table which may be used when N is large (as it usually is for a sample Tor 
which we would calculate E^) was prepared by Woo. ^ 

If 7] IS not zero, but the number of observations in each array is the same 
for all samples, the frequency function for E^ is 

( 11 . 6 ) /(F 2 ) = _ E^)^m{\E^)/B{a, b) 

where 


(1L7) 

and 

( 11 . 8 ) 


X = Wi7V2(1 - 77^), a - 71i/2, b = ^ 2/2 


H(x) = 1 + X + + ^)(<^ + ^ + 1) 3-2 4. 

'' '' ^ l!a ^ 2!a(a + 1) ^ 


wdiich is the confluent hypergeometric funckon. 

The function given by (11.6) obviously reduces to the frequency function 
for a i3(a, b) variate when X = 0. It is, when X is not zero, an example of a 
non-central distribution. Since this distribution is of some importance and 
arises in other problems (see § 12.18), the function has been tabulated by 

Tang. 2 i The tables give the values of / /(F^) where EJ is deter- 
mined by 

r f{E^\^ 0)dE^ = a 

a being chosen as either 0,01 or 0.05. 

11.2 A Test for Linearity of Regression. The weighted sum of squares be- 
tween column means which occurs in ( 11 . 1 ) can be split up into a part depend- 
ing on linear regression and a part depending on the deviation from linear 
regression. 

Since y* — ^ — F* + Ff — ly, we have 

( 11 . 9 ) + 2 ^* 0 ". - vY 

(It is easily prowd that the cross-product term vanishes 'by putting 
Y, = a + bx, and using the normal equations for a and b which are obtained 
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in fitting this line to the column means by least squares.) Equation (1 1 9) 
can be written 


(11.10) £ = jBi + ^2 ^ 

where by (11.1) B = Nsy^Eyx^. Also as in Chapter VIII, = Nsyh^j so that 

(11.11) = NsyKEy.^^ - r^) 

This expression represents the part of the sum of squares between column 
means which cannot be accounted for by linear regression. If this part is 
excessive, compared with the random sampling fluctuations that might be 
expected under the null hypothesis that the true regression is linear, we reject 
the null hypothesis. The basis for comparison is the variation within arrays, 
W = S(y — ytYj which by (11.4) is equal to Nsy^(l — Eyx^). Since for each 
array the sum of squares oi y — is distributed as with Nt — 1 df, 
W itself is distributed as x^o"^ with V — p df. 

Again, since the variance of y^ is <T-/Nt, ^Nriyt — yy is distributed as x“<7-^ 
with p ~ 1 df. Also I ?2 = ^Ntb^(xt — xYj and so is independent of regres- 
sion except for 6^. Since, as shown in § 8.7, 6 is a normal variate with variance 

cr%l p^)/Nsx^ = < 7^(1 — p^)/'^^^ix^ — x)- 

it follows that on the assumption of an uncorrelated parent distribution, 
B 2 is the square of a normal variate with variance (x^ and therefore is dis- 
tributed as xV^ with 1 df. Hence Bi is distributed as x^<r- with p — 2 df, in- 
dependently of W, so that 


( 11 . 12 ) 


Bi N — p ___ Ey^ r^JSf -- p 
Wp - 2 1-Eyx^ p -2 


has the F-distribution with p ~ 2 and iV — p df . A significant value of F 
indicates a significant departure from linearity. A similar test is, of course, 
available for Exy 

The situation may be clarified by an Analysis of Variance Table. Thus 


Variation ^ 

SS 

df 

MS 

About regression hne (Bi) 

Nsy\Ey^^ - r^) 

P-2 

iVs/CK.' - r^)/(p - 2) 

Due to regression (B 2 ) 

NSyh-^ 

N -p 

Nsy^r^/(N — p) 

Total (B) 

Ns/EyJ‘ 

N - 2 



Example 1. In an investigation of the relationship between percentage illiteracy {y) and 
percentage Negro population {x) for 82 counties m the State of Mississippi (1920), it -was 
found that the regression line oiyonx was y = 0.299a; -|- 2.02, butithe line of column means 
appeared appreciably curved. The values of x were grouped m ten classes, '^under 10,’^ 
^MO and under 20, etc. Calculation gave r® =0.7134, EyJ^ =0.7803, with p = 10, 
N = 82. Hence F = 2.74. The 5% and 1% points for ni = 8 and = 72 are about 
2.07 and 2.76 respectively so that the true trend is rather definitely curved. 
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11.3 Fittiiig a Polynomial of Second or Higher Degree. The least squares 
method of § 10,2 for obtaining a linear predicting equation ¥ = makes 

no assumption as to the independence of the Xi, We can, therefore, take 
Xi = 1, X 2 = X, xs — x^, • * - Xp — x^\ and use this method for fitting a 
polynomial of degree p — 1 to a set of observations of pairs of values x^j 
the assumption being that the x^ are either fixed values or variates with a 
negligible error compared with that of y. It is often possible to choose the 
Xa so that they are equally spaced on the a:-axis and thus simplify the com- 
putations. 

The a;jfc of (10.20) are now given by Sx^^Xha = so that the matrix A is 


“W Sx • Sx^-^ “ 

& • • • • 

ISx^-^ 

and Q] is 

Then hj = and the variance of hj is 

For example, to fit the quadratic 

Y == bi + h2X + hzx^ 

we have the equations 


( 11 . 13 ) 


rN 

Sx 

Sx^-^ 


rbil 

Sx 

Sx^ 

Sx^ 


^2 

-Sx^ 

Sx^ 

Sx*. 





to be solved for bi, &2, h. 

If we choose the unit of x, so that the values (supposed equally spaced) 
change by 1 from one observation to the next, and if we choose the origin of x 
midway in the range (at the middle value if iV is even and half way between 
the two middle values if iV is odd), then in the simplest case of one observation 
of y at each value of x, we have 


Sx = = 0 

Sx^ - NiN^ - 1)/12 

Sx^ = N{N^ - 1)(3W2 - 7)/240 


The equations (11.13) then reduce to 

Nbi + {N(N^ - 1)/12}63 = Sy 
(11.14) I N(N^ - 1)62/12 = Sxy 

. N(m - 1)61/12 + iv(iv* - i)( 3 i\r 2 - 7)63/240 ^ Sx ^ 


Example 2. Given the foBowing values of x and y, 


X 

5 

15 

25 

35 

45 

55 

65 

75 

85 

95 

y 

10.0 

8.1 

9.3 

12.1 

13.6 

17.5 

20.0 

24.0 

30.0 

42.5 

u 

- 4.6 

- 3.5 

- 2.6 

- 1.5 

- 0.5 

0.5 

1.5 

2.5 

3.6 

4.6 
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calcufeite (a) a straight and (h) a parabolic trend line, and find the effect of each on the sum 
of squares of y. 

In terms of a new variable w = (x — 50) /lO the equations for the straight trend line are 

(Nhi = ^ 2 / = 187.1 

[N(m - 1 ) 62/12 = Suy - 273.45 

giving 61 = 18.71, 62 = 3.3145. Hence Y == 18.71 + 0.33145(x - 50) = 0.3314x + 2.138. 
The equations for a parabolic trend line are 

r I06i' + 82 . 563 ' = 187.1 
I 82 . 562 ' = 273.45 

l82.56i' + I 2 O 8 . 663 ' = 1817.98(=^i^2i/) 

These give 62' = 3.3145, W - 14.422, 63' = 0.5197, so that 7' - 14.4225 + 0.33145 
(x - 50) + 0.005197(x - 50)^ = 10.842 - 0.1882x + 0.005197x2. 

The calculated values of Y and 7' are as shown: 

y 10.0 8.1 9.3 12.1 13.6 17.5 20.0 24.0 30.0 42.5 

7 3.794 7.109 10.423 13.738 17.053 20.367 23.682 26.996 30.311 33.626 

7' 10.030 9.188 9.384 10.620 12.895 16.210 20.564 25.957 32.390 39.862 

The analysis of variance table is given in the next section (omitting the part dealing with 
cubic regression). Since the mean square for parabolic regression is 45 times the mean 
square for deviations from parabolic regression, the parabolic term is highly significant. 

11.4 Orthogonal Polynomials. The procedure of § 11.3 has the disadvan- 
tage that if it is required to introduce an additional term into the regression 
equation all the coefficients have to be calculated afresh. A method suggested 
by R. A. Fisher involves the fitting of a series of orthogonal 'polynomials^ each 
term being independent of all the others. This means that each regression 
coefficient can be calculated independently, and the tests of significance are 
facilitated. 

Two polynomials Pi{x) and P^ix) are orthogonal if S{PiP^ = 0 where S is 
the sum over a specified set of values of x. If x were a continuous variable in 

r 

the range from a to 6 , the condition of orthogonality would be I P 1 P 2 dx = 0. 

It is easily verified, as in the following table, that the polynomials Fo = 1, 
Pi = a; — 4, F 2 = — 8 a; + 12, F 3 = — 12 a ;2 + 41a; — 36 are all orthogonal 

for the set of integral values of x from 1 to 7. 
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It can be proved ^ that any polynomial in a: of degree k, for a specified set of 
values of x, can be expressed as a linear function of ifc + 1 orthogonal poly- 
nomials, say 

(1L15) Y = 4“ -[-••• + Akkk 

where ?o “ 1> and is a polynomial of degree f (i = 1, 2, • • • fc). 11 x takes 
the values 1, 2, 3 • • • iV, the first four of these polynomials are 

h = — x) 

& = X2[(a: - x)2 ^ {N^ - 1)/12] 
h = Mix - x)^ - (x- x)i3m - 7)/20] 

^4 = Mix ~ x)^ - ix- - 13)/14 + ZiN^ - l)iN^ - 9)/560] 

where 5 is the mean value of x and the X^s are usually and conveniently chosen 
so as to make the values of these polynomials integers (as small as possible) 
for all values of x from 1 to N. Thus if JV* = 7, we have x = 4, Xi = 1, 
X2 = 1, Xs = 1/6, X 4 = 7/12. 

The sets of values of these polynomials are 


X 

I . 

kz 

^3 

I4 

1 

-3 

5 

-1 

3 

2 

-2 

0 

1 

-7 

3 

-1 

-3 

1 

1 

4 

0 


0 

6 

* 5 " 

1 

-3 

-1 

1 

6 

2 

0 

-1 

-7 

7 

3 

5 

1 

3 


All the polynomials with even subscripts have a set of values symmetric 
about the middle. All those with odd subscripts are skew-symmetric (the 
signs changing but not the magnitudes). Hence only one half the table, 
together with the middle line if any, need be given. In the tables the lower 
halves only are printed. 

The regression coefficients Ao, Ai, • • • A* are calculated by least squares so 
as to make Siy — Y)^ a, minimum. The normal equations are 

Ao>S(?o?o) + Ai/S(J{)fi) -f- . • . 4* A;yS(fo?fc) = Siy^o) 

Ao>S(|o?i) + Ai/S(fifi) 4“ • • • + A*iS(^i|A) = Siy^i) 
etc. 

but because of the orthogonal property of these polynomials, and because 
^0 = 1, these equations reduce to 

f AoAT = Siy) 

AiS(€i)2 = sm 


( 11 . 16 ) 
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It is evident, therefore, that each regression coefficient can be calculated 
independently of all the others. 

The sum of squares of deviations from regression is given by 

(11.17) S(y^) - AoSiy) - A.Siy^O AkS(yh) 

The first two terms give the total sum of squares about the mean, since A o = y- 
The 3rd term gives the reduction due to linear regression, the 4th term the 
additional reduction due to parabolic regression, and so on. 

The actual fitting of the polynomials is greatly facilitated by tables giving 
the values assumed by to for all necessary x. For values of N up to 75 
these tables are given in Fisher and Yates’ Statistical Tables (3rd Edition). 
They have been extended to N = 104 by Anderson and Houseman.^ New 
tables by DeLury give the polynomials up to fv-i for N < 26, and also 

the integrals of these polynomials, Ir = I ^r(^) dx and Jr = I ?r(^) dx 

Jo J-l/2 

for r < 14, iV < 26. These integrals are useful in estimating the total 
amount of some variable X from a systematic sample, such as is often obtained 
in forestry, ore-drilling, etc., when samples are selected at regularly spaced 
points, along a preassigned straight fine or series of such lines. 

The use of orthogonal polynomials may be illustrated by fitting a cubic to 
the data of Example 2, in which N = 10. The last three columns are read 
from the tables, and u = {x + 5)/10. 


X 

u 

y 


^2 

^3 

5 

1 

10.0 

-9 

4-6 

-42 

15 

2 

8.1 

-7 

4-2 

4-14 

25 

3 

9.3 

-5 

-1 

4-35 

35 

4 

12.1 

-3 

-3 

4-31 

45 1 

5 

13.6 

-1 

-4 

4-12 

55 

C> 

17.5 

+1 

-4 

-12 

65 

7 

20.0 

4*3 

-3 

-31 

75 

8 

24.0 

4-5 

-1 

-35 

85 

9 

30.0 

4-7 

4-2 i 

-14 

95 

10 

! 

42.5 

-t9 

4-6 

4-42 


We calculate Siy) = 187.1, Siy^i) = 546.9, S{y^ 2 ) = 137.2, S{y^z) - 252.2. 
The values of = 330, >8(^2)^ = 132 and = 8580 are read from the 
tables. Then Ao = 187.1/10 = 18.71, Ai - 546.9/330 = 1.6573, A2 == 
137.2/132 == 1.0394, A3 = 252.2/8580 = 0.029394. 

The total sum of squares is 1071.33. The reduction *due to linear regression 
is AiS(y^i) == 906.38, the additional reduction due to parabolic regression is 
A2S(yfe) = 142.61, and the further reduction due to cubic regression is 
AzSiy^z) = 7.41. The analysis of variance table is, therefore, 
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Variation 

SS 

df 

MS 

Total 

1071.33 

9 ‘ 


Linear regression 

906,38 

1 

908.38 

Deviations 

164.95 

8 

20,62 

Parabolic regression 

142.61 

1 

142.61 

Deviations 

22.34 

7 

3.19 

Cubic regression 

7.41 

1 

7.41 

Deviations 

14.93 

6 

2.49 


The cubic regression term is not significant, so that a parabolic trend line 
would probably be quite satisfactory. The parabolic regression term is 
highly significant, indicating a well-marked deviation from linearity. There 
is, of course, no guarantee that because one term is non-significant all higher 
terms are also non-significant, but one can often form an opinion from the 
relation of the plotted curve to the scatter diagram of the original data. 

In this particular example the addition of a fourth degree term reduces the 
SS for deviations to 1.39 and the mean square to 0.28, while a fifth degree 
term reduces the SS to 1 16 and raises the mean square to 0.29. The mean 
squares for quartic and quintic regression are 13.54 and 0.23 respectively, so 
that, based on the deviations from quintic regression, the third and fourth 
degree terms, but not the fifth degree term, appear highly significant. 

If the second and fourth degree curves are plotted on a graph showing the 
original data, the parabola clearly gives a good general description of the 
course of the data, but the fit of the quartic is much closer. With a ninth 
degi’ee curve we could fit the observed data exactly, but such a complicated 
curve is obviously not desirable. One must compromise between the desire 
for simplicity and the desire to get a good fit, and the parabola would appear 
in this example to give a satisfactory compromise. 

The equation of the parabola is 

F = Ao + Ai^i + A2$2 

= Ao + A{\i(u -u) + A2 \^[(u - ny - (N^ - 1)/12] 

The values of Xi and ^ 2 , from the tables, are 2 and | respectively, and 
- 1)/12 = 33/4, so that 

F = 18.71 + 3.3146(u - 5.5) + 0.5197 (u - 5.5)2 _ 8^25 
= 11.913 - 2.4021W + 0.5197^2 
=- 11.913 - 0.2402(a^ + 5) + 0.005197(x + 5)^ 

= 10.842 - 0.1882a: + 0.005197a:2 

which agrees with the equation previously obtained. In calculating values 
of F for plotting a curve, it is simpler to work with the values of the ^'s rather 
than X. 

11.5 SeidePs Method of Successive Approximations. Sometimes approxi- 
mate values of the constants in a regression curve may be obtained from a 
graph. These values can then be improved by least squares. 
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Let the true curve be F = f(Xj a, where for convenience we suppose only 
two parameters, and let a and b be approximate values of a and respectively. 
Then, if 5a = a — a, 56 = /S — 6, both 8a and 8b may be presumed small, 
and, to a first approximation, squares, products, and higher powers can be 
neglected. Hence 

(11.18) Y - /(x, a + 5a, 6 + 8b) 

= /(x, a, b) + ^Clfa + ^bfb 

approximately, where fa and /& are the partial derivatives of / with respect to 
a and 6. Replacing Y by the observed y, we have a set of observation equa- 
tions for the unknowns 8a and 56. By forming the normal equations and 
solving them, 5a and 56 are obtained, and these values, added to a and 6, 
give improved values of the parameters. The process can then be repeated 
as often as necessary. It usually converges quite rapidly, and two or three 
stages will generally give a satisfactory result. 

Example 3. The following data were obtained in a physical experiment. {E represents 
the energy radiated from a carbon filament lamp per cm® per sec and T the absolute tempera- 
ture of the filament in thousands of degrees C.) 

T 1.309 1.471 1.490 1.565 1.611 1.680 

E 2.138 3.421 3.597 4.340 4.882 5.660 

By plotting on logarithmic graph paper it is seen that the data follow a law of the type 

E ^ aT^ 

with a = 0.725, and h = 3.96 approximately. Here dE/da = dE/dh = aTMog T, so 
that E — clT^ = 5a 4 * aT^ log T 56. Using the approximate values of a and 6 , we 
obtain a set of six equations for 5a and 56. On forming the normal equations and solving 
them, we get 5a = 0.0434, 56 = —0.102, so that the new values are a = 0.7684, 6 = 3.858. 

If we repeat the process with these new values, we get 5a = 0.0004, 56 = 0.0024, so that 
a = 0.7688, 6 == 3.8604. From the sum of squares of residuals, the standard errors of a and 
6 are 0.009 and 0.033 respectively, so that it is quite sufficiently accurate to take a = 0.769, 
6 = 3.86. 


If /(x, a, 6) is a linear function of a and 6, the Seidel method is exact, since 
the second and higher derivatives of / all vanish. The method may thus be 
used in fitting straight lines or polynomial curves, and has the advantage that 
it is not necessary to carry many decimals. 

In the example of § 11.4, suppose that the approximate parabola is 

F = 12 2.5w + 0.5u^ 

Putting a — 12 + 5a, = — 2.5 + 56, 7 = 0.5 + 5c, and forming the normal 

equations, we obtain 

10 5a + 55 56 + 385 5c = 12.1 
55 5a + 385 56 + 3025 5c = 92.5 
385 5a + 3025 56 + 25333 5c = 761.7 
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The solution is dc = 0.020, Sb = 0.098, §a ~ — 0.086, so that the corrected 
parabola is 

Y = 11.914 - 2.402ti + 0.520u2 

11.6 Exponential and Modified Exponential Regression. The exponential 
curve 

(11.19) Y = 

is of fairly frequent occurrence, implying that the observed variable y increases 
or decreases at an approximately uniform percentage rate as x increases. The 
curve • 

(11.20) Y=^a + be^^ 

has been termed a modified exponential curve, and arises also in a number 
of problems. 

The customary procedure in fitting (11.19) is to write it in the form 

(11.21) log Y = log b + px 

and fit a straight line to the observed values of log y as plotted against x. This 
means that the sum of squares of deviations for log y is minimized, instead of 
the corresponding quantity for y. Since d(}ogy) = dy/y, the procedure is 
strictly correct in those problems in which the standard deviation of y in- 
creases in direct proportion to y. For problems in which the variance of y is 
independent of y, the effect of the customary procedure is to give undue weight 
to the smaller values of y. For many data in the field of economics the 
assumption of a standard deviation proportional to y seems reasonable, but 
there are problems in other fields where an exponential trend accompanied 
by a constant standard deviation may fairly be assumed 
The exact least squares solution is laborious. It requires us to calculate b 
and p from the equations 

/nooN / Siy^e”-^) = 

^ ’ \Six^y^eP-a) = 

or, for the modified equation (11.20), a, 6^nd p from 

I' Sya = Na + bSe^^cc 

(11.23) j Siy^e^^a) = aSe^^<^ + bSe^^^a 

* + bSix^e^^^a) 

Rough approximations to b and p may be found for (11.19) by fitting a 
straight line graphically to the values of log y plotted against x; these approxi- 
mations may be improved by SeideTs method. 

Tables for use in fitting exponential curves may be found in Glover's Tables,^ 
but these cover a very limited range of values of p (from 0 to 0.0953, 
between LO and 1.1). Cowden ® has given a method of finding approximate 
values of a, 6, and This consists in plotting the data, drawing a 
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tentative trend line, selecting three equidistant ordinates, and calculating a 
from the formula 


(11.24) 


Yo+Y,- 27x 


(This formula is readily proved by putting a: — A, x and x + A for x in the 
right-hand side of (11.20) and equating the results to Fo, Fi, F 2 respectively.) 
Values of F — a are now plotted on semi-logarithmic paper, and a is further 
adjusted if necessary so that a straight line fits the points reasonably well. 
Then b is the ordinate of this straight line at x = 0 and is the ratio of 
the ordinates at xn and Xi. These values of a, 6, and q may be improved by 
SeideFs process. 

A method of fitting the exponential curve (11.19), which in practice gives 
satisfactory results, is to calculate the straight line regression of log y on x 
with the observations weighted in proportion to y. This procedure approxi- 
mately counteracts the automatic weighting (nearly proportional to 1/y) 
which results from using log y instead of y. If the fitted curve has the equa- 
tion 

(11.25) log F = c + px, c = log b 
the weighted least squares condition is 

(11.26) ^y(\ogy — c — pxy = min 
This gives rise to the normal equations 

= ^y log y 
C^xy + p^x^y = ^xy log y 

whence p and b ( = e<=) are determined. 

The following example is given by Snedecor,"^ x being the age in days of 
chick embryos, and y the dry weight in grams. 


X 

y 

logio y 

6 

0.029 

-1.538 

7 

0.052 

-1.284 

8 

0.079 

-1.102 

9 

0.125 

-0.903 

10 

0.181 

-0.742 

11 

0.261 

-0.583 

12 

0.425 

-0.372 

13 

0.738 

-0.132 

14 

1.130 

0.053 

15 

1.882 

0.275 

16 

2.812 

0.449 


(Common logarithms have been used, so that the’calculated c and p must be 
multiplied by 2.303. Alternatively, the equation may be written in the form 
F = 510^®, with c = logic b) 
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The normal equations have the solution (after multiplication) p = 0.4581, 
c = — 6.2794, so that the exponential equation is 

Y =: 0.001875 exp (.4581x) 

The equation obtained without weighting is 

Y = 0.002046 exp (.4511a;) 
while the exact least squares solution is 

F = 0.001895 exp (.4573a;) 

It is clear that the weighted regression of log y on x gives in this example a 
good approximation. 

The method of weighting could be applied also to the modified exponential 
equation, the weighting being now proportional toy -- a. Preliminary values 
of a, 6, and q having been obtained by Cowden’s method, these values could 
be improved by a weighted Seidel process.® The least squares condition for 
the weighted process is 

2(2/ <^){log {y — a) - da/iy — a) — xp — x 8p - hgb — 8b/b}^ = min 

1L7 Fitting a Simple Harmonic Curve to a Series of Observations. Many 
phenomena appear to be more or less cyclical, and in such cases it may seem 
reason^le to fit a simple harmonic curve to a series of observations. Let 
the curve be 

(11.27) F A cos o)X + B sin (ax + C 

where the period is 27r/a>, and the series extends over an integral number of 
periods. The normal equations are 

' A/S(cos2 cox) 4* jBiS(cos cox sin cox) + CS(cos cox) = S{y cos cox) 

(11.28) ■ AS(cob cox sin cox) + ^^(sin^ cox) + CS(sin cox) = S{y sin cox) 

I ilS(cos cox) + jBASi(sin cox) + CN == S(y) 

On the assumption of an integral number of complete periods, 5 (cos cox) = 
iS (sin cox) = 0. Also S(cos^<ax) = ^(sin^cox) == iV‘/2, since ^8(1-- 2sin2cox) 
= S(2 cos^ cox — 1) = /S(cos 2cox) = 0 , and fi^(sin cox cos cox) = |>S(sin 2cox) = 0 . 
Equations (11.28), therefore, reduce to 

2 

A = S{y cos cox) 

2 

£ = ^ S(y sin cox) 

C = y 

The coefficient matrix of (1 1.28) is 

' iV/2 0 0-1 

0 N/2 0 

0 0 jvJ 


(11.29) 
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so that the variance-covariance matrix, which is the inverse of tliis, is 

r2/N 0 0-1 

0 2/iV 0 


The sum of squares of residuals is 
(11.30) S{v^) = S(3i^) - 


= SW 


N N 

^ [5(2/ cos (ax)Y - ^ [5(2/ sin cor)]** - 


The number of degrees of freedom is iV — 3. Since (11.27) corresponds to a 
wave of amplitude (A^ + or intensity the test for reality of 

the wave or harmonic is a test of whether is significantly diJfferent 

from zero. Since on the usual assumptions for regression A and B are norm- 
ally distributed, it follows from the variance-covariance matrix that they are 
independent and have a common variance 2<T^fN. Hence (A^ + B^)N/2a^ 
has the distribution with 2 degrees of freedom. This provides a test for 
the significance of the observed A^ + provided that can be estimated 
with considerable accuracy, say from a long series of observations. If only 
a comparatively short series is available we may estimate cr^ by SivJ^) /(A^ 3), 


where Va is a residual, and use the fact that N- 


has the 


t/a. J.C/OJlVJlU.CI/4., CI/XJ.V4. Uiit/ iCKVl/ VJ-LCI/V 2 j jy g 

F-distribution with 2 and iV — 3 degrees of freedom. 

If we calculate the value of a; = for a number of different periods 

p, and draw a graph of x against p, the result is called a periodogram. If the 
periodogram has well-marked peaks at certain points, wliich cannot reasonably 
be attributed to sampling fluctuations, these values of p may be regarded as 
the periods of genuine harmonic terms. 

If P is the probability of getting a periodic component of intensity at least 
equal to x, 


(11.31) 


P = 1 






since Nx12<t^ has the x® distribution with 2 df. If x is calculated for n selected 
periods, and x is the maximum value of x, then the probability P that at least 
one intensity will exceed x by pure chance is given by 


(11.32) 


1 - P - (1 


assuming that the selected periods are aU mde|)endent. This formula was 
given by Walker.® Since we do, in fact, pick out the largest intensities for 
examination, Walker's formula should be used rather than (11.31) for judging 
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significance. The test requires a knowledge of o-^, but Fisher has given a 
method which avoids the necessity of knowing If the number of observa- 
tions iV' = 2w + 1, and if xi, X 2 } • * • Xn are the intensities corresponding to 
trial periods of iV, iV’/2, JV/S, • • • N/n, Fisher has obtained the distribution 
oi g = x/2v^ where x is the maximum value of Xi to Xn and v = 5(F -- y)^ = 
(a:i 4“ 0^2 + • • • + Xn) /2. He finds that the probability of a value of g at 
least as great as a given value is 

(11.33) P = n(l - gr)"-i - n (1 - + - • • 

+ (- 1)«-‘^”^(1 - 

the series stopping as soon as 1 — mgr ceases to be positive. 

H. T. Davis has calculated tables of P and also of P in (11.32) useful in 
the analysis of economic time series. A general discussion of the problem and 
of the difficulties of assessing the reality of apparent cyclical or oscillatory 
movements may be found in Kendall’s Advanced Theory of Statistics^ VoL II, 
Chapter 30. 

11.8 Estimation of x for a Given y in Curvilinear Regression. Suppose our 
estimated regression equation is F = feo + hix -I- b^x^. Our estimate of x for 
y = yo will be a real root of yo = bo + bix + b 2 x‘^, Ijdng within the range of x 
for which the regression is presumed to hold. There may, of course, be no 
such root or there may be two roots. If a real root exists, we should like to 
have confidence limits for it. 

Let us replace a: by a parameter X, and let T = bo + bi\ + bz\^. Then the 
expectation of T is given by 

(11.34) EiT) = 00 + 0iX + 02X2 
and the variance by 

(11.35) Var (T) = E[T - 0o - 0iX -- 02X2]2 

= P[(?>0 - 0o) + (hi - 0l) X + (62 - 02)X2]2 
- <72(^00 + ^11X2 + a 22\4 + 2a0iX + 2a^2x2 + 2 a ^^\^) 

by (10.43). Hence { T — E(T ) } / {Var (T) jg normally distributed about 0 

with variance 1. If V\T) is equal to Var (T) with cr^ replaced by its estimate 
= S(^ - Y)^/(N - 3), then 

{T - E(T)}/{V'iT)}^^^ 

has Student’s ^-distribution with V — 3 df . If is the value of t such that 
100(1 ““ oj)% of the distribution lies between ±4, the 100(1 — a) % confidence 
limits are given by 

(11.36) [(to “• 0o) + {bi — 0i)X -f- (62 02)X2]2 

= + 2ai*\=) 
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If when y — yo the true value of x is X, 

2/0 = ^0 + ftX + ^2^^ 
so that on substituting in (11.36) we have 

(11.37) (iV-~3)(6o + 6 iX + 62 X 2 - 2 / 0 )^ 

= VSiy - Yy(a^^ + ai'X^ + a^^X^ + 2a^^\ + + 2a^^\^) 

This is a quartic in X. One root goes with the upper confidence limit and 
one with the lower. There are two extraneous roots. 

11.9 Estimation of Maximum or Minimum in Curvilinear Regression.^^ 
If the true regression is given by 

17 = ^0 + PiX + fta:- 


dri/dx = 0 when /3i + 2 fi 2 X = 0, so that the true maximum (or minimum) is 
given by 

? = - ft/2/32 

The maximum likelihood estimate of f is i 61/262. If wc let 
T = bi + 2b >\ then E(T) = fii + 2^2>< = 0 when X = {. Also 


so that 
(11.38) 


Var (T) == E[bi - ft + 2(62 - ft)X]2 
— + 4a ^^X^ + 4a^^X) 

. (61 + 262X) 

s{a^^ + 4a22X2 + 


has Student^s ^-distribution with iV — 3 df. Hence confidence limits for x 
can be established. 

The same procedure can be used to find the point of tnflechon of a cubic 
curve. If the true regression equation is 


^ = i3o + PiX + ftx^ + ftx® 
“ ^^2 + Q0zx 


and this is equal to 0 when a: = ^ if 

^ ~ ft/3ft 

The estimate is 5 = — 62/363, and the confidence limits are obtained as before 
Again, suppose F is a quadrahc function of two variables u and t?, so that 

(11.39) Y = 61 + 62^4 + bzv + b 4 .u^ + huv + b^v^ 

This is another special case (see § 11.3) of the general multivariate regression 
of Chapter X. Here a:i = 1, 0:2 .= % xz ^ v, x a == x$ = uv^ xq = 

The point at which F is a maximum or minimum is given by BY I du = 0, 
BY /Bv = 0, and is estimated from the equations • 


(11.40) 


f 62 + 264a + 652^ = 0 
1 63 + bhU + 2b(;v == 0 
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If X and are two parameters, and if 

...... J Ti = 62 + 264X + 

1 3^2 = 63 H“ 4“ 2 b&fi 

then Ti and T2 have a joint bivariate normal distribution with a known vari- 
ance-covariance matrix At3<r^(tj j= 1, 2), We have 

E(Ti) = ft + 2/34X + ^6ju = 0 

and 

Var (Ti) - (r2(a2^ + + 4 :a^\ + 4^^^ + 2^25^) 

= (f^Aii 

Similarly, E{T2) — 0 

Var (Ta) = + 4a^®Xju + 

— Cf^Ai,2 

and 

Cov (fiTa) = + 2a^5x2 4. 2 a^Y + + 2 a^)\ 

+ + 2 a 26 )jLi + 

= <T^Ai 2 

If St = St/c^’^ can be written as a sum of squares of two independ 

ij 

ent normal standard variates f: and ^2. For if 

= {A^YKTi + 412^2/411) 

& = [A 22 ~ (Ai2)V^n]i/2r2 

then 

^,2 + ^2' = ^11^12 + ^22^2' + 2 A^W 2 = St 
Also E{^i) = E{^2) = 0, and 

Var (ii) - A^dVar (Ti) + (A 12/A Var (Ta) + 2(AiVAii) Cov {T1T2)] 

- <r2Aii[Au + (A12/A11)2A22 + 2 (A^^Ai 2 )/An] 

= <r2(AiiAu + A 12 A 12 ) + (r2(AiVAii)(Ai2A22 + A^Aia) 

= cr2 

Similarly Var (fa) “ cr^ and Cov (fifa) == 0. Hence Sr/cr^ has the dis- 
tribution with 2 df. If Sm = S(y — F)^; then, as we have proved in Chapter 
X, Sm / has the distribution with W — 6 df (there being 6 variables xo to 
in the regression). Consequently 

jp ^§iKs:J 

^ Sm 2 

has the F-distribution with 2 and W — 6 df. For a fixed value of F, this 
provides a relation between X and fx, the two-dimensional analogue of a con- 
fidence limit. The values of X and fi given by (11.41), with Ti = T2 — 0, are 
the estimated values of u and v corresponding to the maximum or minimum 
of F. 
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11.10 The Geometrical Picture of Multiple Regression and Correlation. 
The term multiple correlation refers to a theory of correlation involving three 
or more variables. For ease in 
exposition we shall restrict the 
derivation of formulas to the 
three-variable case although the 
method is perfectly general. 

When the three-variable case is 
understood the formulas can be 
generalized for k variables. 

The framework of a two-way 
table was a rectangle in the 
X 2 /-plane which was divided into 
cells by lines parallel to the 
axes. The analogue in the case 
of three variables, which we 
shall denote by Xj y, and z, is a 
rectangular parallelepiped divided into cells by slicing planes parallel to 
the axes. 

We shall denote the frequency in the cell whose mid-point has the coordi- 
nates (a;, 2/j 25) by f{xy y,z). A pair of (Xy y) values fixes a z column (Figure 32) , 

and the sum of the frequencies 
in such a column is the column 
totaF^: 

(1 1 .42) 'Xf(x,y,z) = f(x, y) 

Z 

where here and subsequently the 
symbol ^ together with the 
variable underneath denotes a 
summation in the direction of 
that variable. Now consider 
all those colunms which have 
the same y. Their total fre- 
quency, denoted by 

(11.43) ^/(x,y) =/(2/) 

Z 

may appropriately be called a “slab total’’ (Figure 33). 

Finally, if we add all the slab totals we get the total frequency N. Thus 

(11.44) '^fiy) = N 

y 

By making use of (11.42) we may, if we wish, express (11.43) as the doiibk' sum 
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(11.45) 

X Z 

and hence express (11.44) as the triple sum 

(11.46) y,z) =N 

X V z 

(a) The aggregate of the column totals /(x, y) forms a two-way frequency 
table. If we imagine the numerical values of these frequencies written in the 
cells of the a; 2 ^-plane it is easy to see that they constitute a correlation table 
(Figure 34). For this table, the simple correlation coefficient r^y is called the 



Fig. 34 


total correlation (in contradistinction to a partial correlation coefficient to be 
deffiaed later) and the regression curves are called the total regressions of y 
on X and x on y. Discussions analogous to (a) may be given for horizontal 
columns parallel (6) to Ox and (c) to Oy. 

The mean of a column at (a;, y) is defined by 

(11.47) z{x, y) = V’ 

Similarly, the mean of an x column at {y, z) is 

(1 1 .48) x{^, z) = 

and the mean of a t/ column at (x, z) is 

(11.49) yix, s) = 

The regression plane of z on xy is that plane which fits the means of the z 
columns best in a weighted least-squares sense. This should not be confused 
with the true regression surface, z on xy, which is defined as the locus of the 
mean points of the z columns. More accurately, it is the locus of these points 
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as the dimensions of the cells approach zero and as iV oo . The regression 
plane, z on xy, is that plane which fits best the true regression surface, z on xy. 
Corresponding statements hold for the regression planes of y on xz and of 
X on yz. 

So far, it was convenient to designate our variables by the conventional 
letters used in representing three-dimensional space. We are now about to 
obtain the equations of the regression planes and in order to extend our results 
to k variables it will be desirable to change to a new set of symbols which will 
lend themselves more readily to generalization. The switch will cause no 
difficulty. We shall now use Xi in place of x, X 2 in place of and 0:3 in place of c 
The relations between the r^s in the old notation and the new are Xxy == ri 2 , 

Tyz = r23, rxz = ri3. 

We shall now derive the equation of the regression plane of Xi on X 2 and x^. 
In determining, under a least-squares criterion, the parameters in its equation 
it will simplify the exposition if we assume that the variables are measured 
from their respective means as origin. This may be assumed without loss of 
generality. Let the desired equation be of the form 

(11.50) Xi = A.X 2 “f" "f* C 

Then we may determine the parameters in (11.50) so that the sum of the 
squares of the residuals 

( 11 . 51 ) U = '^(xi - Ax 2 - Bx 3 - cyf 

is a minimum, / being short for f{xij x^, X 3 ), and ^ Equating to 

2,3 272 273 

zero the first partial derivatives of TJ with respect to .4, B, and C, we obtain 
the equations 

'^X2(xi — Axi — Bxz — C)f = 0 

'^xzixi — Axi — Bxs — C)f = 0 

C = 0 

The simplification of the last equation is a consequence of our choice of origin 
since ^Xif = "^x^f = ^xsf = 0 when the origin of Xt is at the mean of its 
N values. The first two equations may be written in the form 

I A'X^-^f + B^xzXzS = '^XiXzS 
[A'^XiXaf 4- B'^xaJ = '^XiXzf 

Let s,* be the variance of Xi and let be the correlation coeflBcient between 
Xi and X,. Then by definition, 

X2, xs) — Nsy 

'^tXjfixi, Xi, xi) - Ns^SjXii 
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So (11.52) becomes 
(11.63) 


r NAsi^ + NBsiSsr-is = NsiStrut 

1 NAsiSzTis + NBsz^ = NsiSzViz 


Solving for A and B we have 


A 



s 

s 


£i_ 

rn 1 


S2 

1 rzz 



rzz 1 



5 = ^^ 


^3 



1 rn 
rzz rn 



1 rzz 

rzs 1 



It is convenient both for sinaplicity and for the purpose of generalizing to k 
variables to define the determinant R by 


R = 


rn 

rn 

riz 

r2i 

r22 

r23 

rsi 

rz2 

r33 


and to let Rt^ be the cofactor of fi,. Thus, 


R 


12 = 


^21 r23 

^31 ^33 


Rn ~ 


^21 ^22 
^31 ^32 


Clearly, rn = r 22 = rzz = 1, and = r 2 i, etc., so the expressions for A and B 
may be written 


A 


B 


siRi2 

S 2 R 11 

SiRiz 

SzRn 


Hence (11.50) becomes 


(11.54) -Rn + ~Ru + -Rn = 0 

Si S2 Sz 

This equation gives the estimate of Xi for assigned values of and 0 : 3 , provided 
that the true regression is not far from being linear. It is an important 
equation because it shows how, on the average, changes in X 2 and Xz affect xi. 
The student will observe that the R^s involve only simple correlation coelfi- 
cients and that all the necessary computations for the terms in (11.54) were 
explained in Part I. ' ^ 

There are two analogous equations for the regression planes of X 2 on Xi and 
Xz, and Xz on Xi and X 2 , which can be obtained readily from (11.54) by a 
cyclical permutation of the subscripts on x and R, They are 
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(11.55) -R 22 + - R 23 + - E21 = 0 

$2 Sz Si 

when X 2 is the dependent variable, and 

(11.56) ^Rs3+-R31 + -R32 = 0 

Sz Si S 2 

when Xs is the dependent variable. Referred to an arbitrary origin (11.54) 
would have been 

(11.57) — ~ Rn + — - - - Ru + -- Ri3 = 0 

Si S2 Sg 

where Xt — = Xv Analogous adjustments of (11.55) and (11.56) are 

obvious when the variables are referred to an arbitrary origin. 

The three-dimensional case can now be generalized. By methods similar 
to those employed above we can derive the linear regression equation for k 
variables. Thus we have the hyperplane xi on X 2 j Xzj • • •, Xk, 


Si ^ 


X2 


(11.58) 

where is the cofactor of in 


(11.59) 


R = 


;i2 + • • 

, Xk 

. ~ 

Sk 

rn • • • 

• *rik 

• * r22 • 

.... 

rn • • • 

• • Tkk 


When expressed in standard units, (11.68) becomes 


(11.60) 


1 ^ 

^1 = 

nil i»2 


where h — Xi/st, Then ti may be regarded as a weighted mean of the con- 
tributions of the other variables. The factor Ru represents the force or weight 
of ti when all these variables are given an opportunity to predict the value of tu 
It may be noted that (11.58) is simply a rewriting in different notation of 
the regression equation (10.1), in which the hj are given by (10.20). For if 
the x's are all measured from their means and if y is written as x^ we have 

aij = TijSiSj, gj = rijSiSj-j (f, j = 2, 3, • • • A) 

Then in the equation 

(11.61) Xi == b2X2 + hzXz -[“•*• + bjcXk 

the ht are given by b^ = By writing out ^he matrix A = [a^J\ 

it is easily seen that its determinant is equal to ^2^53^ • • * s&^Sn, so that 
= Rn-ijJ StSjRiu where Ru.n is the cofactor of th in jBh. Also by the 
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rule for expanding determinants is equal to — Bti, and hence 

Equation (11.61) can, therefore, be written 

•^11^1 I 1^21^2 I I Rkl^k Q 

Si 52 Sic 


which, because of the S3mimetry of the correlation matrix, is equivalent 
to (11.58). 

11.11 Variance about the Regression Plane. Let v be the distance, 
measured parallel to the a^i-axis, between the regression plane and the point 
(xij X2, That is, v = observed Xi — estimated Xi, where the estimated xi 
is given by (11.54). Let 

(11.62) Si 23^ = ^ ^vJ(Xi, X 2 , Xz) 

where ^ d^iiotes summation over all the points (xi, X2, Xs). Then 

ivsi.23* = :s/ { 0:3 + ^ 0:3)} 
f ^fRnXi , B12X2 , Ei3a:3V 

Since J^fxi^ == Nsi^, ZI/X1X2 == Nri 2 SiS 2 , etc., we have 

^1.23^ = (*^11^ + -^12^ + Ei 3^ + 2i2iii?i2ri2 + 2RiiRizriz + 2Ri2Rizr2z) 

= [-Rii(-Kii + T 12 R 12 + TizRiz) + RniRu + TuRn + TzzRiz) 

+ Rn(Riz + riaiBii + T 2 ZR 12 )] 

- fe) 

by the familiar rules for determinants. Hence we have 

(11.63) sx23* = si==^ 

illl 


The square root of this is usually called the standard error of estimate of Xi 
for assigned values of X2 and X3. As in the corresponding case of two variables 

N 

(see §§ 8.6 and 8.7), Si,23^ 3 unbiased estimate of the corresponding 

population parameter <ri2p/Pii, where Pis the determinant | pi, |, ijj = 1, 2, 3, 
and Pii is the cofactor of pu in P. Moreover, as in § 8.9, the sampling error 
of the coefficients in the regression equation introduces additional terms into 
the standard error of estimate of Xi, so that the s.e. of an observed Xi is actually 
given by the square root of 

^ , 2 fi 4- ^ 4- 4- 
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11.12 Variance Due to Regression. The variance of xi due to the regres- 
sion (11.54) is given by 

where Xi is the estimated value from (11.54). Hence, as in § 11.11, 

Si23^ = (1^12^ + Ri3^ + 2J?i2jBi3r23) 

= [Si 2 (i 2 i 2 + ^23^13) + RniRiz + ^23^12)] 

— [”"-Ki 2 ri 2 i 2 u “ ^13^13^11] 

Consequently 

(11.64) 5X23^ = 51^1 - R/Rll) 

It follows from (11.63) and (11.64) that 

(11.65) = 5i 23^ “h ^123^ 

11.13. The Midtiple Correlation Coefficient. With two variables the pro- 
portion of the total variance explained by regression is equal to where r is 
the ordinary coefficient of correlation. So here, the quantity 

(11.66) ri«' = ^=l--|- 

is the square of the multiple correlation coefficient of Xi on X 2 and xz. This 
coefficient may be regarded also as the ordinary correlation coefficient between 
the observed Xi and the estimated a;i. If we denoted the estimated Xi by 0 : 123 , 
we have for this correlation 


(11.67) 


— 1 / R 12 X 2 _ :Ri^\ 

NsiSm NRnSm ^ ^ \ S 2 Sz / 

“ — ~ [i?i2ri2Si + Riz'f'izSi] 

rillSl23 



which agrees with (11.66). By a cyclical permutation of the subscripts we 
can write at once the formulas for the multiple correlation coefficients of xz on 
0:1 and X 3 , and of 0:3 on xi and Xz* They are 



( 11 . 68 ) 
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/ 1 ? 

(11.69) r3i2 == 

By writing (11.63) in the form 



we obtain the formula 

(11.70) 5123^ = si^(l - ri.23^) 

which is quite analogous to the expression for in simple correlation. It is 
clear from (11.70) that 

(11.71) -1 < ^23 < 1 

Each of the formulas (11.67) to (11.69) may be generalized for k variables. 
Thus the multiple correlation coefficient of order A; — 1 of a:i with the other 
fc — 1 variables is 

(11.72) = 

where now Rxj is the cofactor of Vxj in R as defined in (11.59). 

Example 4. Three variables have in pairs simple correlation coefficients given by 
ri2 — 0.8 ri3 — —0.7 r2s = —0.9 

Find the multiple correlation coefficient ri.as of xi on X 2 and xz. 

Solution. 

1 .8 -.7 

iB = .8 1 -.9 = 0.068 

-.7 -.9 1 

Rii = 0.19 ri.23 — 0.80 

Example 5. Suppose it is found that ri 2 = 0.6, nz = —0.4, r 23 - 0.7. Comment on these 
results. 

Solution. It = —.346, Rn = .51, ri. 2 a = 1.29. This is an impossible value of ri. 23 . It 
is clear from equations (11.67) to (11.69) that iBu, JB 22 , and Rzz must all have the same sign 
as R if the multiple correlation coefficients are to be numerically less than 1. In this example 
suspicion might be aroused by noting that while ru and r 28 are both fairly large and positive, 
riz is fairly large and negative, contrary to what one would expect, 

11.14 Some Limiting Cases of Multiple Correlation. 

Theorem 11.1. The necessary and sufficient condition for coincidence of the 
three regression planes (11.54), (11.55), aTid (11.56) is 

(11.73) 7*12^ ^13^ “b ^23^ — 2ri2ri3r23 “ 1 

Eor these planes are coincident if and onlyif the coefficients are proportional. 
This willbeso if i?ii/i22i ~ Ru/Riz “ Riz/R 2 z^^dLR 2 i/Rzi ~ R 22 /RZ 2 ” R^z/Rzz* 
On writing out the cofactors in terms of it will be found that these rela- 
tions are equivalent to (1]L.73). 

Theorem 11 . 2 . If ri .23 ==t:l, then (11.73) is satisfied and the regression is 
linear. 
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For by (11.66) the condition for perfect multiple correlation is == 0, and 
this is equivalent to (11.73). 

Example 6. Given the following data, rn = 0.6, ns = 0.4. Find the value of ras in order 
that ri.23 = 1. 

Solution. Substituting the given values in (11 73) we have 

— 0.48r ~ 0.48 =* 0, 

where the subscripts are dropped for the moment. Solving, we find r = 0.24 ± 0.73. 
So r23 = 0.97. 

The example shows that even though rn and rn are individually small, it 
does not follow that there cannot be high correlation between Xi, and Xz. 
Indeed two variables which individually with a third variable have correla- 
tions which are apparently worthless for predicting purposes may be very 
valuable when the three variables are taken together and multiple regression 
employed. On the other hand, it may be possible to get as good a prediction 
from ri 2 or rn using simple regression as from multiple regression. This 
situation will be clarified by the following theorems. 

Theorem 11 . 3 . Ifr2z = lithenrmz^ = ri2^ = and si. 23^ = si^(l rn^). 

Proof: When ras = 1, i? = 2 ri 2 ri 3 — rn^ — rn^ = — (^2 — rnY- But also 
iu this case Rn == 0 , so that jB = 0 from (11.67). Hence rn = rn. Now 

when ri 2 = ri 3 the general expression for 1 — reduces to -r -— — so that if 

Kn 1 T“ ^"23 

we now put raa = 1, (11.66) gives n 23 ^ = n 2 ^ = rl 3 ^ and (11.63) gives 
Sinz^ = 8x^(1 — riaO- 

In this case, then, multiple regression has no advantage over the simple 
regression xi on X 2 or xi on 0 : 3 , because the standard error is exactly what it 
would be if the third variable were not added. Since ras = 1, there is perfect 
linear dependence between X 2 and 2 : 3 . Geometrically, all the data lie in the 
regression plane. 
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Theorem 11.4. If r^s = 0, then n 23 ^ = 

R 

For Rii = 1, and JS = 1 — — ris^, so that 1 — ^ + ru\ 

ihll 

Therefore 

(11.74) S 123 ' = 5i"(l - ri/ m^) 

Hence, when x% and xz are completely independent, multiple regression gives 
a better prediction than would be given by either of the simple regressions 
a^i on X 2 or xt on xz; very much better if also ru and 7*13 are nearly equal. If they 
are exactly equal their maximum value is = 0.707. This theorem shows 
that one has a good regression equation for predicting when each of two 
variables is highly correlated with the third variable but not with the other. 

11.15 The Distribution of the Multiple Correlation Coefficient for Samples 
from an Uncorrelated Parent Population. In the notation of Chapter X, 
the regression of y on the variables Xij Xs, • • • Xp is given by 

r = 

i 

and the multiple correlation coefficient of y with Xi, X 2 , * • * Xp is 

(11.75) = S{yY)/{S{y^)S(JW^ 

This is the ordinary correlation coefficient of the observed y values with the 
estimated F values, measured from their means. 

The normal equations for the regression coefficients are 

^ Qxj i = 1, 2 • • • p 
j 

where 

Utj = SXu^jaj Qx = Sxx^y^ 

For convenience we may suppose the variables standardized, so that the 
means are all 0 and the variances 1. If Xo is written for j/, we then have 


O'xi = gx = no 


and the normal equations become 


( 11 . 76 ) 





If the residuals v are given by y = y — F, 


- -S(y - Y)xi = S{x, - 'X^,x,)xi 
= ro, - '^bfxj 
, = 0, by (11.76) 


Hence /S(«F) = ^hSivx,) — 0, and therefore, since iSr(»F) = S(.yY) — S(Y^), 
(11.77) ‘ 8(yY) = 8(Y^ 
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Now 

S(v^) - - 2yY + Y^) 

= Siy^) - 2S(yY) + NCF^) 
= S(y^) - S(Y^) 

by (11.77), so that from (11.75) we have 


(11.78) 

This may be written 


2 = >S(F^) 

S(v^) + S(.Y-^) 


(11.79) ry/0. - r,y^) = S{Y^)/Siv^) 


If the Xif X 2 j • ' ' Xp all have a fixed set of values and if is a random normal 
variate independent of xi - •• Xp (that is, the multiple correlation coefficient 
in the parent population is zero) then S(Y^) and S(v^) are independently dis- 
tributed as with p and iV^ — p — 1 degrees of freedom. Hence the quantity 

S(Y^) N -p - I 


(11.80) 


F 


Siv^) 


P 


is distributed as Snedecor’s F with p and iV — p — 1 df . This means that 
TjfY^ is a Beta-variate with parameters KiV" — p — 1) and |p. The distribu- 
tion of is, therefore, identical with that of the square of the correlation 
ratio, (see § 11.1), withp + linsteadofp. We are now dealing with p + 1 
variables, y and xij X 2 j • • • Xp and so have p degrees of freedom for the re- 
gression of y on the x^s. 

11.16 The Distribution of the Multiple Correlation Coefficient for a Corre- 
lated Parent Population. A geometrical picture of multiple correlation may 
be helpful to some readers. For the jth variate x, we have N observations 
such as x/a (a == 1, 2, * • • iV). In a flat (Euclidean) space of N dimensions 
the whole set of observations may be represented by a single point, of co- 
ordinates Xia, or by a single vector joining this point to the origin. 

If the x,a are measured from their mean the square of the length of this vec- 
tor is equal to where is the standard deviation of x,*. Also if Oxi and 
Ox,* are the vectors corresponding to the variates Xi and x,*, the cosine of the 
angle between them is equal to S(xiaX,«) /[ 5 (xia 2 )S(x,^ 2 ) ~ correlation 

coefficient for Xi and x,*. 

If we consider the case of p = 2, 
the two vectors Oxi and 0 x 2 will 
determine a plane (Fig. 36) and 
the vector OF, where Y = 6iXi 
+ & 2 a^, will lie in this plane. 

The vector Oy will in general 
lie outside the plane, but since 
OF is determined so as to make 
SQ/ — F)* a minimum^ F is the foot of the perpendicular fr<»n p on to this 
plane. If is the angle between Oy and OF, 
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cos B = S(3jY)/{S{y^) • 

and so is equal to the multiple correlation coefficient of y with xi and If 
the number of predictors (p) is greater than 2 , the same general picture holds, 
but there is no unique perpendicular to the space determined by Xi, 0:2, * • * Xp. 
There is instead a (p — l)-dimensional sub-space, but we will not pursue this 
matter further. 

It was proved by Fisher that the distribution of VyY for a given sample 
size and for a given number of predictors is a function of the multiple correla- 
tion coefficient p in the parent population, and of this alone. To show this, 
we first apply to the a non-singular linear transformation so as to get new 
variables Xj which are uncorrelated. We may, for example, let xi = xi, 
x^ = X21 (the deviations of X2 from the regression of X2 on xi)j x/ == Xz.21 
(the deviations from the regression of Xz on X2 and Xi), and so on. This trans- 
formation will leave F and therefore VyY invariant. Geometrically it consists 
in choosing new vectors Ox/ which are mutually perpendicular, but the angle 6 
of Figure 36 is unchanged. 

We can now apply an orthogonal transformation so that the correlation 
between y and one of the new variables xi' is a maximum in the parent popu- 
lation, Then y is uncorrelated with aU the other new variables and the multi- 
ple correlation coefficient becomes the ordinary correlation coefficient p 
between y and Xi\ Since all the other correlations in the parent population 
are now zero, and since these transformations leave r^r invariant, the fre- 
quency function of must be a function of p. 

The function is a complicated one, namely, 

p-2 

(11.81) /(r^) - (1 - p2) 2 (1 r2) 2 (y, 2 ) 2 



where r is written for r^y, F is the hypergeometric function andB is the Beta 
function. For p — 0 , is a Beta-variate. 

It may be proved that the expected value of ryy^ is given by 

(11.82) EM = 1 - (1 - p^)f(i, 1, p*) 

_ j _ j^Z j E .Z. 1 (1— p2) ^1-4- -| ^ 1- . . .^ 

which, when p = 0 , reduces to 

(11.83) EM - 

11.17 Partial Correlatioir. Assume, as before, that the variables xi, xt^ xz 
are referred to their own means as origin. Suppose that we wish to know what 
the correlation between xi and x% would be if the influence of xz were eliminated* 
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Let us subtract from the Xi of each point that part of Xi which is due to the 
influence of xs, as indicated by the regression of Xi on xa, and denote the 
residual by Xi.a- Thus we have 


(11.84) 


Si 

Xi a = Xi — Tis — Xz 
Sz 


S2 

X 2 Z = Xi — TiZ — Xz 


We now define the partial correlation coefficient {rn z) of Xi and Xi in the 
trivariate distribution of xi, Xi and a; 3 as the ordinary correlation coefficient 
of a;i.3 and Xi 3. By definition, 


(11.85) 


ri2 3 


^3^1 zXi.zfjXi, Xi, Xz) 
Nsi 3S2.3 


The numerator may be written, using (11.84), as 


^XiXif — Tiz '^XiXzf — TiZ ^ ^XiXzS + nzTiz ^ ^Xzlf 
= NsiSiirn — rizVa — Tizrn + ri 3 r 23 ) 

= NsiSiira — ri 3 r 23 ) 


In the denominator of (11.85), Si 3“ is the residual variance of Xi after 
eliminating the regression on xz, and hence 

(11.86) si.3“ = SiKl - nz^) 

Similarly 

(11.87) S2.3'“ = S2KI - TiZ^) 

Substituting in (11.85) we obtain the result 


ri2 — ri3r23 




(11.88) ri2 3 - 

If hi2 3 and 621.3 are the partial regression coefficients of Xi 3 on 0:2.3 and of 
0:2 3 on xia respectively, 

%X2 zf 3 

^ — ri2 

^^2.3V 

^a:i 3J2 zf 


(11.89) 


612.3 


§ 2. 3 


621.3 = 


3“/ 


ri2 3- 


§1.3 


SO that 


(11.90) ^12 3^ “ 612.3621 3 

From equations (11.86) to (11.89) we readily oFtain 


612 3 


£1 ^12 
^2 Rii 
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and hence &12.3 is identical with the 62 of equation ( 11 . 61 ). In the same way, 
613.2 = 63, so that in the present notation the regression plane of Xi on and 
Xz may be written 

Xi = bn,2X2 + hiz.iXz 

The partial correlation coefficient may also be regarded as the correlation 
between Xi and x^ when xz is held constant. That is, we limit attention to a 
sub-set of the whole set of observations, a slab parallel to the X1X2 plane, and 
calculate the ordinary correlation coefficient for this sub-set. A classical 
example is the correlation between statures of fathers and sons, when the 
stature of the mother has a particular value, say 62 inches. 

The partial correlation coefficient defined iu this way depends in general, 
however, on the value of xz selected. Necessary and sufficient conditions that 
this coefficient is independent of Xz, and is equal to as defined by ( 11 . 85 ), 
are that: 

(a) The bivariate regression of Xi on xz (ignoring X2) is linear, and the stand- 
ard deviations of all the Xi arrays (xz constant) are equal. 

(h) The trivariate regression of xi on Xz and Xz is linear and the standard 
deviation of all the xi arrays (xz and Xz constant) are equal. 

If these conditions are satisfied, and if ri2.3 is the correlation coefficient 
for Xi and xz with xz constant, 

Sl.ZZ^ = 5i, 3^(1 ~ ri2 3 ®) 

and 

SUZ^ = SlKl — ^13^) = Si^Rzz 

By ( 11 . 66 ) and ( 11 . 70 ), 5 i. 23 ^ = si^iZ/jBn, so that 1 — ri2.3^ = R/RnRzzy'whmce 
rnt.z^ = Rn^/RiiR22j in agreement with (11.88). 

Tables of (1 — and of 1 — have been prepared by J. R. Miner to 
facilitate the computation of ri2.3. By letting sin 0 = r, cos 6 — (1 — 
one can use ordinary trigonometric tables for the same purpose. 

The conditions (a) and (6) given above are not likely to be satisfied very 
accurately in practical applications. The calculated value of ri2.3 wiU be a 
sort of average value of the correlations which could be obtained for all 
assignments of xz* 


Example 7. In a study of th^ factors which influence “academic success, ” May obtamed 
the following results (among others) based on the records of 450 students at Syracuse Uni- 
versity. 


Xi = honor points 
Xi = 18.5 

Si = 11.2 

rn = 0.60 


X 2 = general intelligence 
X 2 = 100.6 
(S 2 ~ 15.8 
ris == 0.32 


Xz = hours of study 
la = 24 

Sa = 6 

r28 — 0.35 


One purpose of the study was to find to what extent honor points were related to general 
intelligence, when hours of study (per week) are held constant. Using (11.88) it is found 
that ri 2 .? = 0.80. 
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The other partial correlations are ris 2 = 0.71, r 23 1 == — 0.72, so that for all three pairs 
of variates the correlations are stronger when the effect of the third variable is eliminated. 

11.18 Partial Correlations with k Variables. The formulas of § 11.17 may 
be extended to more than three variates. If there are k variates altogether, 
the partial correlation coefficient of Xi and aJs, with all the rest eliminated, is 
denoted by rn 34 . . -k and defined as the ordinary correlation coefficient between 
the residuals vi and V 2 for the multiple regressions of xi and X 2 on the other 
variates. This definition gives 

(11.91) ri2.34 R 12 / (RllR 22 y^^ 

where Rtj is the cofactor of in the fc-rowed determinant (11.59). 

Partial correlations may be defined of all orders from 1 to ft — 2. Thus 
when ft — 4, there will be two first-order partial correlations of Xi and X 2 j 
namely, ri2 3 and ri2 4, and a second-order partial correlation ri2.34. A partial 
correlation of any order may be expressed in terms of partial correlations of 
order one lower, by a relation similar to (11.88). For example, 


(11.92) 


Ti2 34 


Ti2 4 — 7*13 4^23 4 


[(1 -ri3 4^)(l ~r23 4^)F'^ 



11.19 The Distribution of the Partial Correlation Coefficient. As already 
described, the set of values (a = 1, 2, • • • iV) may be regarded as fixing a 
point Xi or a vector OXi in 
A’-dimensional space. If these 
values are all measured from 
their mean, Sxia = 0, so that one 
degree of freedom is lost, and the 
point Xi is in a space of iV — 1 
dimensions. (Xi is the projec- 
tion of the original Zi on the 
hyperplane Sx^x == 0.) 

Let OXi, OZ2, OXz (Figure 
37) represent the sets of values 
of Xi, X 2 , xz in this (N — 1)- 

dimensional space. If XiA and X 2 B are drawn perpendicular to OX 3, XiA 
and X 2 B represent the vectors Xi 3 and X2.3 respectively, which are such 
that Sxi.^ and Sx^.z^ are minimized. If then 6 is the dihedral angle be- 
tween the planes XiOXa and XzOXz, cos 6 is the coefficient of partial corre- 
lation between Xt and X2, with Xz eliminated. From the diagram we see that 6 
is the angle between the projections of OXi and OX2 on the (N — 2)-dimen- 
sional space perpendicular to OX3. It follows that the sampling distribution 
of ri2.3 is the same as that of ri2 (see (8.68)), but \rith iV — 2 instead of X — 1 
and with instead of p. With ft variables altogether the distribution of 
ri2.3...jfc has X — ft + 1 instead of X — 1. 


Fig. 37 
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Further discussion of the trigonometrical aspects of correlation will be found 
in a paper by Dunham Jackson.^® 

11.20 Correlograms. In a time series where a variable x is measured as a 

function of the time t, it will often 
happen that the observations are 
correlated. The graph of rx,x2 as 
a function of the time (fe — ^i) 
between Xi and X2 is called a cor- 
relogram, A theoretical corre- 
logram is shown in Figure 38 . 
This model arises if we suppose 
that only consecutive observa- 
tions really influence each other. 
That is, the partial correlation 
between xi and xs, eliminating the influence of X2^ is zero. 

Since pis ~ PnPiz _ r. 

~ 10 - ~ 

this implies that pis = P12P23. If the correlation is constant between successive 
consecutive pairs of members of the time series, pos == P12, so that pn = 

In the same way, P14.23 = 0 , implying pu,z = P12.3P24.8, whence we readily 
obtain pu = pi2^. In general = pi2'“~^l If the observation Xa corresponds 
tot = a, and if pi2 == po, the correlation between two observations separated 
by time t is given by 

p = po^^ 

This is the curve shown in Figure 38 . 

Given a set of values such as the Xa^ correlated in this way, we can form a 
new set of uncorrelated quantities by means of a linear transformation. Thus, 
if all the Xa have the same variance 0*2 and if Xa and have covariance 
let us put 

y, Xi(X - p^y^^ 

^2 = iC2 P2^r 



Then 


Also 


VN — Xif — pXN-^i 

Var (yi) - cr^ (I — p^) 

Var (2/2) - <rKl - 2 p 2 + p^) = - p') 

Var lyii) - - p^) 

Cov (3fij y^ = (1 p^yf^cr^ipJ"^ - p • p»-2) = 0 


Cov (2/», yj) = cr 2 [p^-* — p • - 

== 0, 2 < i < j < V 


and 
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Hence the quantities yi, 2 / 2 , * • • Vn are uncorrelated and have a common 
variance ©'^(l — p2). The ordinary least squares theory can be applied to 
these new variables. Thus if we wish to construct a regression equation for a 
variable xo in terms of XijXzj • • • Xp, and if successive observations on all these 
variables are correlated in the manner described above, the regression equa- 
tion will be of the form 

J J 

and will be used to predict values of Xoa — pa:o,a~i, instead of 
1121 Discriminant Functions. Suppose we have several criteria 
Xij X 2 , • • • Xp, each of which may be used to distinguish between two popula- 
tions, I and II. Thus for xi the two population distributions may be somewhat 
as shown in Figure 39. It is obvious that, since these overlap, we shall 
sometimes make a mistake in al- 
lotting an individual to one of 
the two populations on the basis 
of the Xi value alone. The prob- 
lem arises, therefore, of finding 
what function of xi, 0 : 2 , •• • Xp 
will give the smallest probability 
of error in assigning individuals 
to one or the other of these 
two populations. An example 
is the use of aptitude tests, intelligence tests, and the like, to make an 
appraisal of a student^s chance of success in, say, a university engineering 
course. On the basis of these tests a student may be classified as I (likely to 
make a success of engineering) or II (unlikely to do so). The vocational 
adviser wishes to know what function of the test-scores available will serve 
to discriminate most accurately between these two classes. 

If the two distributions in Figure 39 are adjusted in scale so that the area 
of each is unity, we should naturally take a as the dividing point for classifica- 
tion. An individual with an Xi greater than a would be put in population I. 
The probability of mis-classifying a II as a I would be 1 — ^(0), where 

r 

^(0) = I the distributions being assumed normal. The 

probability of mis-classifjdng a I as a II is identical. If = 0, this probability 
is I, and therefore there is no effective discrimination. 

If we imagine a hypothetical perfect discriminant y which can take only two 
values —1 and +1, —1 for all individuals in II and 4-1 for all individuals 
in I, and if 

xi = a + ^y + € 

where e is normally distributed about zero, we have the situation depicted 
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in Figure 39. It is necessary that the ^ of this regression equation shall be 
significantly greater than zero, if Xi is to serve as a useful discriminant. 

In the more general case, suppose that there are and N 2 individuals in 
the populations I and II respectively, and that there are p variates xi, ^ 2 , • • • Xp 
which may serve as criteria for discrimination. The hypothetical perfect 


N 2 


for all individuals 


discriminant y may be supposed to take the value . 

N, iVi + N2 

in I and — tt — r- ^ for all in II. This ensures that y = 0 for the combined 
-r •A '2 


populations. If now we calculate the regression equation of y on Xi, X 2 , • • • Xp, 

p 

say Y = then Y will be the best possible linear combination of the x, 

i=i 

for discriminating between T and II. It will give the best estimate of the 
perfect discriminant y. It is called the discriminant function. 

Let x^ra be the ath value of x^ for the rth population, so that r is either 1 or 2, 
and a = 1, 2, * • • iVr. Let Xxi be the mean of Xx for population I and Xx 2 the 
mean for population II. It is supposed as usual that the mean of Xx for the 
combined population is zero. Then the least squares criterion for the choice 
of the bx is that 

^S{yra — ^bxXtraY = Wlin 


Differentiating with respect to bx and putting the derivatives equal to zero, 
we obtain the normal equations 

(.Vrct “9, Z = 1, 2, * * * p 

r ^ J 

or 

(11.93) . = fif,, i = 1,2, ■■■ p 

i 

where 

(11,94^ Q-xj ~ iTa^ jrct 

ct 

r 

and 

(11.95) fir. = '^Sx^raVra 


Since when r = 1, yra = N 2 / (iVi + N 2 ) and when r = 2, i/ra = — Ni/ {Ni + Nz) 
for all aj we have from (11.95) 

Ni + Ni ^ Ni -t Ni ^ 

' N 1 + N 2 ^ 

(11.96) = 

where = NiNz/iN'i + N 2 )' and di = Xxi — Xx 2 , the distance between the 
means of the two populations for the variate Xi. Again, from (11.94), 
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“ "i” i^i&a ^t2)(^32a ^32)} 

oc 

+ iViXtlXjl + N2^i2Xj2 


Let us denote the combined sum of products for the variables Xt and Xj for 
the two populations taken separately by Stj. Then 

(11.97) dtj — Sij “b ^ix%iX]i Hh N2Xx2X]2 


Now since the mean of Xx for the whole population is zero, 

NiXix “b N2X%2 — 0 


that is, 


NiXii + N^ixii — dt) = 0 


or 

(11.98) 


Xxi = 


Ni + 



Similarly 5*2 = dx/N^^ Hence from (11.97) 

ctx] = Sij + did, 

(11.99) = Sif + X^dxd, 

The normal equations (11.93), therefore, become, on substituting from (11.96) 
and (11.99), 

2Wj + X2d.d,) =X2d. 
or 

(11.100) = X2 d.(l - 

j 

If the matrix is the inverse of the matrix [/St,], it follows that the 6,- 
are proportional to ^ If is the constant of proportionality, we have 

i 

from (11.100) that 

- X2dt(l - a^^d.dkS^'^) 

and since becomes 

aHx = X2dt(l - 

where 

(11.101) ' D^ = ^djdkS”‘ 

so that 

(11.102) a2 xV(l + X2D2) 

The significance of the regression may be tested by the usual methods. 
The total sum of squares for y is 
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(11.103) 



The sum of squares due to regression is 
(11-104) 8(72) = 

r “ ij 

i 

= X'2^^d^, by (11.96) 
== a2X2i)2, by (11.101) 

= X2 _ 0,2 

Hence the residual sum of squares is 

X2(l -- aW^) = a2 
and the analysis of variance is 



SS 

df 

MS 

Regression 


V 


Residual 

X2(l - 'Xid^) 

iVi 4- iV2 — p — 1 

X2(l - ]^6.d.)/(Ari + AT, - p - 1) 

Total 





By the property of the multiple correlation coefficient given in (11.78) it is 
evident that the quantity which is the ratio of the sum of squares due 

to regression to the total sum of squares, is simply the square of the multiple 
correlation coefficient of y with xi, a; 2 , • • • That is, 

TyY^ = d'xdjS^^ = 

so that the quantity D of (11.101) is proportional to the multiple correlation 
coefficient. 

Although y is not a random variable while the a;’s are random variables (an 
inversion of the usual state of affairs), the F-test remains valid for the non- 
vanishing of the 6t. 

Also, if is the residual mean square and if is the ith diagonal term in the 
matrix [5"^], then is distributed as Student^s t, with ^ 1 +^ 2 — p—l 

degrees of freedom, on the hypothesis that fit (the true value of 6t) is zero. 

We may also test o^theatetical discriminant Junction with coefficients proposed 
arbitrarily. If Fo represents such a function, the regression on Fo will have 
only one degree of freedom, and the difference of the sum of squares for regres- 
sion on F and on Fo will have p — 1 degrees of freedom. The significance of 
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this difference may be estimated by the F-test, by comparison of the mean 
square for the difference of regressions and the residual mean square after 
allowing for regression on F. 

Example 8 (D. M. Seath). The amount of Dutch clover in a forage stand was estimated 
by a mechanical counter (xi) and by eye ( 3 : 2 ). The two treatments to be discrimmated were 
randomized in 15 blocks of two plots each, so that 14 df could be taken out for block differ- 
ences, giving an analysis of variance: 



df 

SS(xO 

SS(x2) 

SP(xiX2) 

Between populations 

1 

13.47 

8.43 

10.65 

Between blocks 

14 

93.11 

54.69 

60.95 

Within populations 

14 

20.44 

6.41 

4.89 

Total 

29 

127.02 

69.53 

76.49 

L: 


The SS and SP (sum of products) between populations are the quantities symbolized as 
~ 1, 2), where X2 = 15/2. The quantities d* were 1.34 and 1.06 respectively. 
The St] are now the sums of squares and products within populations, so that the Ot, of the 
normal equations are the sums of the items in the fhst and third rows of the above table. 
The normal equations are, therefore, 

33.915i + 15.5462 = 10.05 
15.5461 + 14.8462 = 7.95 

the solution of which is 

61 = 0.0976, 62 = 0.4336 
so that the best discriminant function is 

Y = 0.097623?! + 0.4336x2 

The inverse matrix is 

r 0.0568 - 0.05951 
L- 0.0595 O.I297J 

The analysis of variance for the perfect discriminard is 



SS 

df 

MS 

Regression 

4.428 

2 

2.214 

Residual 

3.072 

13 

0.236 

Total 

7.5 

15 



Note that in the above example 


= 0.0976(1.34) + 0.4335(1.06) = 0.590 

80 that = 4.428. From the residual meau square 0.236 and the 

inverse matrix o*», we obtain the standard errors of 6i and bj. These are 
respectively (0.0568 X 0.236)^/=* == 0.1158 and (0.1297 X 0.236)1/“* = 0.1751. 
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It is, therefore, clear that bi does not differ significantly from zero. The matrix 
S„ is 

[20.44 4.89*1 
L 4.89 6.4lJ 

so that 

<?., = r -0.045651 

L- 0.04565 0.1908 J 

Hence d,diS'^ = 0.1928. The coefficient of correlation between y 

and Y is given by = 4.428/7.5 = 0.5904 = 3.0721)2. 

Suppose we now ask whether the best discriminant is significantly better 
than a proposed discriminant, say z = Xi-\- x^. The SS for z is equal to 
33.91 + 14.84 + 2 X 15.54 = 79.83. The SP for y and 2 is 

Syxi 4- Syxi = X2(di + &) = 18.00 


The coefficient of correlation between y and z is given by 


(18.0)2 
^ X 79.83 


= 0.5411 


so that the SS due to regression on 2 is x 0.5411 = 4.058. 
We have then the following table: 



SS 

df 

MS 

Regression on z 
Additional for 1 

4.058 

1 

0.370 

regression on F j 

0.370 

1 

Residual 

0.072 

13 

0.236 

Total 

7.5 

15 



The additional sum of squares for regression on Y over and above that on z is 
not significant. Hence z = Xi + xt would be as good a discriminant ag the 
one calculated, as far as the available data go. 

Finally we may consider the analysis of variance for the observed discrimi- 
nant F. The total SS = S(Y^) ~ X^^bidi == 4.428, and that between popula- 
tions is iViFj® -H N 2 Yi‘ = X2(]^,d,)2 = 2.614, with 2 df since here p ®= 2. 
The analysis of variance is 



SS 

df 

MS 

Between populations 

2.614 

2 

1.307 

Within populations 

1.814 

13 

0.139 

Total 

4.428 

t 

15 
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The standard error of the observed difference Yi — F 2 is (0.139)^^^ = 0.373. 
Since the actual difference is ^b%di = 0.690, the significance of this difference 
can be estimated. 

11.22 An Alternative Approach to Discriminant Functions. We may ask 
what linear function of the Xtj say 

(11.105) L == i = 1, 2, • • • p 


will give the greatest possible value for the ratio of the sum of squares between 
populations to the sum of squares and products within populations. The 
sum of squares between populations for the variable Xt is 

by (11.98). Hence the sum of squares between populations for L is 
(11.106) B = XK2^.dO" = d,d, 

i,J 

The SS for Xt and the SP for Xi, Xj, within the populations, are the quantities 
denoted above by and Svj, Hence the total sum of squares and products 
within populations for L is 


(11.107) 


W = 2SM, (t,i = l,2,.--p) 

ij 


We require to choose the U so as to make the ratio B/W a maximum. 
Since ^ and ^ = 2'^J,, the condition 


dl 




gives 




B 


w 


'XSJi = 0 


or 

(11.108) 


y dtj i Ij * ■ * p 


where y = XWjB”*''®. Therefore 

(11.109) h^y^d^S'’ 


so that-apart from a constant of proportionality, I,- is equal to b, as given by 
the regression approach described in the previous section. The L of (11.105) 
is, therefore, in effect the same discriminant function as Y, since, of course, 
it is only the rofoo of the coefficients that really ^matters in choosing a dis- 
criminant function. 

Since from (11.107), (11.108), and (11.109) 
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we have 

It was shown by Hotelling** that ^ — 2 1 jg distributed as 

Snedecor^s F with p and Ni + N 2 -- p — 1 degrees of freedom, on the assump- 
tion that the populations are actually identical. 

The discriminant function is closely related to a measure of generalized 
distance between two populations, proposed by Mahalanobis,^’' and also to a 
generalized T-test, suggested by Hotelling for distinguishing between the 
means of dififerent multivariate normal populations. 

Given two samples of sizes Ni and N 2 , with p variates measured for each 
sample, any one observation on the ith. variate for the rth sample will be 
assumed to be given by 

(11.110) ^ira ~ Mtr "f" 

where i = 1, 2, • - » p, r = 1, 2, a = 1, 2, • * • JVr, and where the„ e^ra have a 
multivariate normal distribution with mean 0 and covariance matrix [at,]- 
The difference of the means for the two populations for the variate x% is 

— P%1 ““ Mt2 

and the generalized distance is given by 

(11.111) A* = i2<r«5.5j 

V ij 

where is the inverse of [at,]. 

Bose and Roy have studied the sampling distribution of the studen- 
tized statistic 

( 11 . 112 ) pD^^'Xs^’dd, 

where St,* is an estimate of at, from the pooled sum of squares within samples. 
In the notation of § 11.21, St, — 8tj/iNi + iV '2 ~ 2). 

There is a bias in this value of since E(D^) = but this is small 

if Ni and N 2 are large. However, if we define Do^ as — X^, we have 

(11.113) E(Dq^) = A2 

The distribution of Ho* is complicated. It is in effect what is known as 
nofirceniral x*. If a: = p HA/X*, and t = X*/pA*, so that xt = D/A, 
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(11.114) fix) == 

2 

where hi^) is a Bessel function of imaginary argument, the properties of which 
may be studied in Whittaker and Watson’s Modern Analysis or in Watson’s 
Treatise on Bessel Functions. 

The function D (or Do) has been used in anthropology for classifying and 
distinguishing between certain human populations, on the basis of a con- 
siderable number of measurable traits. 

Example 9. In a certain experiment (actual details somewhat simplified), each of a num- 
ber of rabbits received both a high dose and a low dose of insulin (m random order) and the 
bloodsugar was measured at 1, 2, and 3 hours after each dose. Denoting these measured 
values by Xi, X 2 , and xs, the SS and SP are given by the following table: 



df 

SS 

,SP 



xi 

3:32 

XiXz 

XiXz 

XzXz 

Between populations 

(X^d^d,) ' 

1 

519 

3503 

5645 

1349 

1712 

4447 

Within populations 
(S„) 

34 

2677 

2358 

3223 

1278 

1814 

1966 


The inverted matrix is proportional to 

r 3.735 -0.553 -1.765-j 

-0.553 5.337- -2.947 

L- 1.765 -2.945 4.679 J 

If di is the mean low-insulin value minus the mean high-insulin value for Xt, the data give 
values of di, d 2 , dz proportional to 353, 917, 1164. Hence, from (11.109), the values of Zi, h, 
Iz are proportional to —1.243, 1.272, 2.123 respectively. (Constant multipliers are dis- 
regarded throughout all these calculations.) A close approximation to the best discriminant 
would, therefore, be 

Ij == — 3iri “f" Sxz “f" 5xz 

On evaluating B and W for this discriminant, by (11.106) and (11.107), we obtain 
B - 235,091, W = 107,446. Since p = 3 and — 36, the value of F is 23.3, with 

df 3 and 32. 

If, instead, we try as a discriminant the mean of the three observations Xi^ Xz, and xa, or 
equivalently 

L = Xi "i" Xz "h Xz 

we find 

B = 24,683, W = 18,374, F = 14.3 

It appears that the first discriminant function is distinctly better than the second. By the 
regression method of the previous section it may be shown that the difference is significant 
at the 5% level. 

Problems 

1. Prove the statement in § 11.1 that if a straight line Y = a + is fitted to the array 
means in such a way that S = ^Ntiy% — a* minimum, then this line is the ordinary 
regression line of y on x. 
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2. Prove that if has the distribution given by (11.6) with X == 0 then {ni(l — E ^) } 

has the F-distribution with ni and 712 df, where = p — 1, W2 = iV — p. 

3. Prove that the sum of the squares of N numbers differing by 1 and centered at 0 is 
N(N^ — 1)/12, whether N is odd or even. 

For N odd, the numbers are * • • —3, —2, —1, 0, 1, 2, 3, • • * . for N even they are • • • 
— 2i, —If, — f, f, If, 2f • • • , 

Prove also that the sum of the fourth powers is N{N* — — 7)/240. (See § 11.3.) 

Hint, See (1.28) and (1.30). These results may be proved by induction. 

4. Prove that the variance of the jth coefficient A, in the expression of F as a sum of 

& + 1 orthogonal polynomials' is and that the covariance of any two A„ A„ (t 9 ^ j) 

is zero. Show that the estimate of <7* is 

- AoS(y) AkS(y^k)]/(N - A - 1) 

6. Verify the computations of Example 3, § 11.5, by writing out the observation equa- 
tions and the normal equations and solving the latter. 

6. Prove that the exact least squares solution of the problem of fitting F = a + he^ is 
given by equations (11.23). 

7. Write out the full discussions analogous to (a) of 5 11.10 for columns parallel to Ox 
and to Oy, 

8. Show that the Gompertz curve F == and the logistic curve F = A/(l ce«») 
are similar in form to the modified exponential curve if for the Gompertz log F is expressed 
as a function of x and for the logistic 1/F is so expressed. Hence Cowden’s method of fitting 
may be applied to these curves by plotting log F (or 1/F) against x (see Reference 6). 

The Gompertz curve has been used in actuarial work, and the logistic in population 
studies. Both are curves with horizontal asymptotes. 

9. Find the multiple correlation coefficients and the regression equations for the data in 
Example 7, §11.17. 

10. {Garrett) The r for intelligence and school achievement in a group of children 8 to 14 
years old is 0.80. The r for intelligence and age in the same group is 0.70. The r for school 
achievement and age is 0.60. What will be the correlation between intelligence and school 
achievement in children of the same age? 

11. {YuU and KenddQ) The following means, standard deviations, and correlations are 
found for 

Xi = seed-hay crops in cwte. per acre 

Xi = spring rainfall m inches 

Xa » accumulated temperature above 42® F. in spring 
in a certain district in JtGngland during 20 years. 


Xi = 28.02 

«i * 4.42 

fi 2 *= 0.80 

X 2 = 4.91 

32 - 1.10 

riz = —0.40 

Xj * 594 

83 ~ 85 

r%z = —0.56 


Find the partial correlations and the regression equation for hay crop on spring rainfall 
and accumulated temperature. 

12. The following data relate to land values ana crops m twenty-five Iowa counties. 

Xi ~ average valpe per acre of farm land on January 1, 1920 
X2 = average yield of com per acre in bushels 1910-1919 
Xt » per cent of farm land in small grain 
Xi * per cent of farm land in com 
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Couniy No. 

Xi 

Xj 

Xz 

X 4 

1 

$ 87 

40 

11 

14 

2 

133 

36 

13 

30 

3 

174 

34 

19 

30 

4 

385 

41 

33 

39 

5 

363 

39 

25 

33 

6 

274 

42 

23 

34 

7 

236 

40 

22 

37 

8 

104 

31 

9 

20 

9 

141 

36 

13 

27 

10 

208 

34 

17 

40 

11 

115 

30 

18 

19 

12 

271 

40 

23 

31 

13 

163 

37 

14 

25 

14 

193 

41 

13 

28 

15 

203 

38 

24 

31 

16 

279 

38 

31 

35 

17 

179 

24 

16 

26 

18 

244 

46 

19 

34 

19 

165 

34 

20 

30 

20 

257 

40 

30 

38 

21 

252 

41 

22 

36 

22 

280 

42 

21 

41 

23 

167 

35 

16 

23 

24 

168 

33 

18 

24 

25 

116 

36 

18 

21 


(а) Find the lineax regression equation of Xi on X 2 X»X 4 . 

(б) Estimate the first five values of Xx, using the equation obtained in (o). 

(c) Calculate 8i 234 and ri. 284 - 

13. (Pearl and Surface) In a biometrical study of egg production in the domestic fowl, 
measurements of length, breadth and weight were made on 453 eggs. From all these, the 
value of ri 2.8 was —0.8955. If the 42 eggs weighing from 53 to 53.9 gm are considered 
alone, the ordinary correlation coefficient ri 2 between length and breadth is —0.9117. 
Similarly the 46 eggs between 56 and 56.9 gm give ri 2 = — 0.8911 and the 13 eggs between 
62 and 62.9 gm give fi 2 = —0.8739. 

Show that the weighted mean of these values of ru is very close to ru s, thus verifying for 
this example the interpretation of ri 2.3 given at the end of § 11.17. 
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CHAPTER XII 


FURTHER CONSIDERATIONS ON STATISTICAL INFERENCE 


12.1 Introduction. The importance of inference in statistical theory and 
practice has been repeatedly emphasized in previous chapters. Apart from 
the purely descriptive side of statistics, in which the characteristics of a given 
finite population are represented graphically or summarized by a few measures 
such as the mean and standard deviation, almost all interesting statistical 
problems are concerned with estimation. Usually we have to estimate the 
parameters of a parent population from a sample, the form of the distribution 
being known or assumed, and to assign confidence limits for these estimates, 
but there are also problems of non-parameiric inference^ when we wish to infer, 
for instance, something about the form of a distribution and are not concerned 
with the numerical value of the parameters. 

In this chapter we give a sketch of certain methods of estimation and in 
particular of Fisher’s method of maximum likelihood and the Neyman-Pearson 
theory of statistical inference, both of which have been referred to occasionally 
in previous chapters. Closely connected with this theory are the sampling 
practices which have been developed in industry in recent years and which 
go under the general name of quality control. 

12.2 The Best Unbiased Estimate of a Parameter. Fisher’s Inequality. 
Let us suppose that we wish to estimate a single parameter 0 of a parent 
population, by means of a statistic T which is calculated from a sample of N 
independent observations • • • xn of a variate x. The statistic T is an 
unbiased estimate of ^ if 


(12.1) E{T) s e 

whatever the values of any other parameters, B'j • which may occur in 
the frequency fimction for the variate x. 

It is the best unbiased estimate if its variance is at least as small as that of 
any other unbiased estimate of 6. That is, 

(12.2) E[{T - ey] - min 


Fisher’s inequality states that 


(12.3) 


Var (T) > 





where f stands for /(a:, 6, 0' ■ ■ •), the frequency function for x. We now give 
a proof of this statement. 
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xn 


Since the observations are independent, the probabihty density of the 
observed sample is 


P = fixi, d,e'-- ■)Six2, e, 0' • • •) • • ‘f(xN, d,B' •) 


Let us make a change of variables from Xi, oi^ - • -xn io |i, I 2 • • • Iw-i, T, 
where the ^’s, like T, are suitable functions of the x’s, not depending on 6. 
Let the frequency function for T be g(T, 6,6' • • •) and let the conditional joint 
frequency function for the ^’s, given T, be hi^i, & • ■ ■ ?iv-i \T,6,6'-’ •)• 
Then by the ordinary formula for change of variable, 

(12.4) P dxi dxi • ■ • dxN = gh, dT d|i • • • d^N-i = gh\J \ dxi • • • dxN 


where J is the Jacobian of the |’s and T with respect to the x’s and is inde- 
pendent of 6. The functions f, g, k sdl depend on 6, and it is assumed that 
these functioiis, as well as Tg, satisfy the conditions given in § 3.1 for differ- 
entiatiag with respect to 6 under the sign of integration. If so, T is said to 
be a regular estimate of 6. 

Since f, g, h are frequency functions. 


/ 


fdx = 1, 



Differentiating these equations with respect to 6^ we obtain 


d^i 


“ = 0 


These may be written 

(12.5) J ^ (log f) fdx = 0, J ^ aog g)g dP = 0 

J J ^ dji • • ■ = 0 

Also, from (12.4), log P = log g + log /i + log | / [, whence on differentiating, 


( 12 . 6 ) 


d log P _ d log g d log h 
dO ee 36 


1 J j being independent of 6, Squaring both sides of (12.6), multiplying each 
side by the corresponding side of (12.4) and integrating^ we have 


(12.7) J ■ J Pdxt--- dxj, 

Since log P = log/(a:,, 6), the left-hand side of (12.7) may be written 
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/ • • ■ / [2 ^ w. ■ ■ • 

where/, stands for/(a:f, 0, 0' • • •). The second sum vanishes on integration 
by virtue of the first equation of (12.5). The first sum has N terms all 
identical after integration, and may be written 

The right-hand side of (12.7), with the second equation of (12.5), gives 

/ 

The second term in this expression is essentially non-negative, so that we have 

(125) N f 

the sign of equality occurring only when h is independent of 6, so that 
{d log h)fdd = 0. Equation (12.1) is equivalent to 


J TgiT, B,e' ■)dT 8 
and on differentiating with respect to 0, we obtain 



/r|fdr = i 

Since 

o 

II 

fDjfD 

it follows that 

J*(r-0)|2dr = i 

or 


(12.9) 

j{T - d logg/dd) dT = 1 

Now, there is 

a useful lemma, known as Schwarz’s inequality, which states 

that if 4> and ^ are real functions of x with integrable squares over a given 

range, 

< J' 4^- dx J dx 

(12.10) 


This is readily proved by noting that the quadratic form in u and v, 
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IS non-negative for all real values of u and v. But the condition that 
+ 2Buv + W is non-negative is that 5® < AC, and this condition 
immediately gives the Schwarz inequality. Appl3dng this to the two functions 
in (12.9), we have 


(12.11) 1 < j<J - eyg dT J g(d log g/dey dT 


The first factor on the right-hand side is Var (T), and hence, by (12.8) 


Var (T) > 



1 

(a log S/dBYf dx 


which is Fisher^s inequality P 

The sign of equality in (12.10) can occur only if = 0, which means 

that ^ is proportional to <i). In this case it means that 


(12.12) d\ogg/dB = k(T — 0) 


where k is independent of 8. 


Example 1. If T is a statistic used to estimate the mean of a normal distribution, for 
which 

/ = ( 27 r<r 2 )~i /2 exp ( — (a; — /t)2/2o-2} 

we have 

dlog/ ^ a; - M 

dll 

so that 

= ? n 

= N/a^ 

Hence the variance of !r > a-^/N. Since c*/N is actually the variance of the sample mean 
is the best possible tmbiased estimate of ii. 

In this example 

g(x) = (iV/27r)i/2<r-“i exp { -N{x - Iiy/2a^} 
so that d log g/dii — N(£ — At)A®« This is of the form (12.12) with k ~ iV/o-®. 


12.3 Consistent and EflSlcient Statistics- A statistic T is said to be consist- 
ent, as an estimate of a parameter 6, if when the size of the sample N is 
increased indefinitely T tends in the stochastic sense to the value 6, That is. 
for any given e > 0. 

Pr{lr-0l >€}->0 asiV^oo 


The arithmetic mean x of a sample from a normal parent population with 
parameters m and a is, as we have seen in Chapter VI, normally distributed 
about M with variance e^/N, Hence 




1 <r 

= 2{1 - 
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where i»(x) is the distribution function of the normal law. For fixed e and a, 
N can be taken so large that is arbitrarily near to 1. This proves 

the consistency of x. 

If T is such that {l)E{T) 6, (2) Var (T) 0 as AT.— » oo , then it follows 

from the Bienaym4-Tchebycheff inequality (§ 4.14) that T is a consistent 
statistic for the estimation of B. This is a useful rule for determining con- 
sistency. 

The less the sampling variance of a statistic the more reliable the statistic 
will be as an estimate of a parameter 6. It is, therefore, natural to take the 
efficiency of a statistic as inversely proportional to its variance. The expres- 
sion on the right-hand side of (12.3) is the minimum value Vmin of the variance 
of any unbiased statistic used to estimate B. The ratio of this minimum value 
to the actual variance of T is called the efficiency of T, A statistic with an 
efficiency of 100% is called a most-efficient statistic. When a most-efficient 
statistic exists it can be found by the method of maximum likelihood (§ 12.4). 

Example 2. The mean and the median of a sample of N from a normal population with 
parameters ^ and a are both consistent statistics for estimating ix. Their variances are c*® 

(for any N) and Tr<T^/2N (for large N). Hence, at least for large N, the median is less efficient 
than the mean; its efficiency is approximately 2/x or 63.7%. This means that an estimate 
of M from a sample of 64 observations, using the mean, is just about as reliable as an estimate 
from a sample of 100, using the median. 

Example 3. Suppose that is the parameter to be estimated m a normal parent popula- 
tion, fi being known. 

al0g/ __ 1 , ix-ix)^ 

d(T^ 2<r2 ^ 2o-^ 

so that 



JL 


The variance of a most-efficient estimate of o-® is, therefore, 2a^/N, Now the sample 
variance, multiplied by N/{N — 1), is an unbiased estimate of and its variance is 
2<r*/(N — 1). Its efficiency is, therefore, (N — l)/iV and it is only asymptotically most- 
efficient. However, the estimate S(xa — ^y/N has a smaller variance but is biased. 

12.4 Sufficient Statistics. The Method of Maximum Likelihood. In the 
proof of Fisher^s inequality, we saw that the sign of equality in (12.3) can 
occur only when two conditions are satisfied: 

(a) h is independent of 6j 

(h) dlogg/SB = k(T — B), 

If condition, (a) is satisfied, T is called a sufficient statistic, whether or not (b) is 
satisfied also. The probability density P is then, from (12.4), expressible in 
the form 

P^gh\J\ 

where gr is a function of T and Bj and A | J ] is independent of B, 
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If we let L = log P, and call L the likelihood j 

(12.13) L = Li(T, e) + L 2 

where is is independent of 9. Knowledge of does not contribute anything, 
therefore, to the estimation of and it is for this reason that T is called 
suflBicient. In other words, T gives all the information that the sample can 
supply about the parameter 9, 

The condition of sufficiency does not determine T uniquely, since any func- 
tion of T will satisfy the condition (12.13) equally well. We naturally choose 
a function which will give a consistent estimate of dj and if possible one that 
is unbiased. 

A way of obtaining estimates, known as the method of maximum likelihood j 
is due to R. A. Fisher and has already been used several times in earlier 
chapters. It is the most important general method of estimation known, at 
least theoretically. In practice, the equations to which it leads are often 
intractable, but it is usually possible in such cases to improve the estimate 
given by a less efficient statistic by means of an iterative process, one or at 
most two repetitions of which will give a result practically as good as the 
maximum likelihood estimate (see § 12.7). 

The method of maximum likelihood consists in taking as an estimate of 9 
that value for which P is a maximum. Since L is a monotone increasing 
function of P, L is also a maximum for the same value of 9, and it is generally, 
more convenient to work with L. Hence if the equations 

(12.14) — = 0, —,<0 

are satisfied for 9 = T, then T is a maximum likelihood estimate of 9, A solu- 
tion which is a mere constant is disregarded. 

If there are two or more parameters, say 0, • • • , it may happen that T can 

be found to maximize L whatever the values of the other parameters may be. 
In that case there is a unique maximum likelihood estimate of 9, Even if 
there is no unique estimate, it is often possible to estimate two or more 
parameters simultaneously. 

Example 4. For the normal distribution of Example 3, 

P = (27ra*)-^/*exp {-S(Xcc - ny/2a^] 
where the sum is from 1 to iV' for a. Hence 

£ = C - iiSr log <r* - S(xa - M)V2<r* 

^ = -N/2a* + SCaa - ny/2<r* = 0 

oar 

afr 

N/2a* - S(Xa - m)*A» = -^N/2<r* 

The maximum likelihood cctimate of a* is 5P «= (i/N)S(xa — m)*> which depends on p. If, 
however, we maximize L simultaneouflly for <r* and p by solving together the equations 
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we obtain 


dL _ ^ dL 

dix ““ dcr2 


0 


M = Sxa/N — X and = SiXa — xy/N = s2 

so that the sample mean and sample variance are simultaneous maximum likelihood esti- 
mates of M and 0*2 respectively. 


The importance of the maximum likelihood method depends on the follow- 
ing properties. If a most-efficient statistic exists^ the method will give it. If a 
sufficient estimate To exists, any solution of (12.14) will he a function of To. 

For a sufficient statistic, h is independent of 6, so that from (12.6) 

^ ^ log g{To, e) 

dd dd 

If this is zero, g is a. function of To but not of d. If the statistic is also most- 
efficient, 

a(iog g)/de = k^To - e) 

so that the equation dL/dS = 0 has the unique solution T = To- 
Another important property of maximum likelihood statistics is that, under 
certain conditions, they tend to normality for large N and have in the limit 
minimum variance. In other words they are asymptotically normal and 
asymptotically most-effident} 

Conditions strong enough to ensure the approach to normality are: (1) if /is 
the frequency function /(a:, d), then the first three derivatives of/ with respect 
to 9 exist, for all 6 in some interval containing the true value ^o; (2) these 
derivatives are integrable over all x; and (3) 



is finite and positive for all 6 in this interval. 

For any 6 in this interval, we have by Taylor^s theorem 

where we assume that [ R{z) j has an upper bound independent of 6, and where 
the subscript 0 means that (? is to be put equal to do. 

Since i = <3 log/(a:„, 6), we therefore obtain 

1 dlt 

= Bo + Bi{d — 6o) + — 5o)® = 0 



(12.15) 

where 
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and 


B^ = ^SR(x.) 


Now since J" f dx — 1, f being a freJjuency function, it follows that 


(12.16) 






for every 6 in the interval. Consequently 


and 

(12.18) 



dd) 


dx = 0 


E 


/3!i2S/\ 
\ de^ U 


Jlfde^ \fd0, 


fa 


dx — E 


f(x, 0o) dx 

mj 


= -E 
= - P 


[('«j 


by the condition (3) mentioned above. 

Hence Bo is the arithmetic mean of N independent random variables with 
the same distribution and with expectation zero and variance k^. By the 
central limit theorem it follows that Bo converges in probability to zero and 
that NBo is asymptotically normal with variance In the same way, 

Bi converges to — k^. It may further be proved that for sufficiently large N 
the equation (12.15) has, with probability arbitrarily near to 1, a root arbi- 
trarily near to 6o, 

Moreover, if T is this root, we can write equation (12.15) as 


(12.19) 


T - 


;S0 

Bi ^B2[T ^ o ) 


Since Bz remains finite as N increases, the denominator in (12.19) tends to k!^ 
as N increases. Writing (12.19) in the equivalent form 


kVN(.T - flo) 


we see that the denominator tends to 1 and that the numerator is asymptoti- 
cally normal with mean 0 and variance 1. 
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Hence the maximum likelihood estimate T is, in the limit for large N, 
normally distributed about the true value Bn with variance given by 


( 12 . 20 ) 


—„=^Nk^==-NE 

CTT^ 


( d^ log A 

\ Jo 



This is a very useful way of finding the standard error of a maximum 
likelihood estimate. The quantity is regarded by Fisher as a measure 
of the information relevant to 6 provided by the sample. 

If Ti and are simultaneous maximum likelihood estimates of two param- 
eters, 01 and ^2, and if these estimates have a bivariate normal distribution 
with variances and and covariance po’icr2, then it may be shown that 
and 0*2^ are the cofactors of the terms in the principal diagonal of the 
determinant 



divided by the determinant D itself. The derivatives are evaluated at the 
true values di^ and 62^ • This theorem may be extended to more than two 
parameters. 


Example 5. For a parent population with a probability 0 of ** success in a single trial, 

the probability of r successes in N trials is The likelihood function 

is, therefore, ' ^ 

(12.22) L = log + r log 9 + (AT - r) log (1 - $) 

maximum likelihood estimate of the parameter 6 is given by 
dL_r N-r _ 

ae 0 1-0 ^ 

whence the estimate is found to be 

0 « r/N 

Also, 

__ r _ N -r 
““ ^ (1 “ tf)* 

Now the expected value of r is N0j so that 



N 


0(1 - 0 ) ^ " 

The variance of 0 is therefore ^(1 — 0)/N, which is the well-known formula. The standard 
error is given by substituting for 0 its estimate r/iV. Each trial contributes an amount of 
information measured by [^(1 — 
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12.5 Curve-fitting by the Method of Maximum Likelihood. It was pointed 
out in Chapter V that the method of moments, as frequently used for curve- 
fitting, is not the most efficient method possible, except in certain special 
erases. E. A. Fisher ^ has shown that the maximum likelihood estimates of 
the parameters in various Pearson types of distribution curves have 
variances which are generally less, and often much less, than those of the 
estimates furnished by the method of moments. Thus, if we know that the 
frequency curve is that of a Gamma variate (Type III) with a single param- 
eter X > 0 to be estimated, we have 

fix) = X > 0 

and 

L = — N log r(X) + (X — 1)(S log x„ — Sx„ 

Then 

(12.23) £ log r(X) + 5 log x„ = 0 

and the estimate of X is the unique positive root X of (12.23). This root may 
be found from a table of the Digamma function.^ The variance is found from 

02.2i) ^._Jr£.,0gr(X)— i 


by the help of tables of the Trigamma function.® 

By the method of moments, we estimate X from the mean x of the distribu- 
tion. Since E(x) = X and Var (x) = X, this estimate is x itself, with variance 
\/N. The efficiency of the moment estimate is, therefore, 

[x|iiogr(x)]- 

and this is always less than 1, tending to zero as X — » 0. Since the skewness 
is the more skew the curve the less the accuracy of estimation of X 

by the method of moments. For a skewness of I, the efficiency is about 88%. 

In practice, however, we usually require to determine three parameters of a 
Type III curve in order to fit an empirical distribution such as that of Example 
3 in § 5.14. If the equation of the curve is 

(12.25) f = Kit + X > 2 


where < = (x— ij)/a-,'K = = X’'/®e~Y[<rr(X)], the parameters MjC®, and 

yi are, by the method of moments, estimated by the statistics h, h, and 
respectively. The variance of ki is ar^/N, and that of h is 


N'^ N-1 


= 2<r< 
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The variance of gi is difl&cult to evaluate exactly but for large N it approximates 

The likelihood function is 
(12.26) L = log X - X - log r(X) - log <rj 

+ (X - 1)5 log{^^ + 

and the estimation of fi, cr, and X would require the simultaneous solution of 
the equations dLjdfi = 0, dLjda = 0, dL/d\ = 0. This is clearly a formid- 
able task. We can, however, estimate these parameters one at a time, assum- 
ing the others fixed. Thus for the estimation of m we have 


and 




X - 1 


d^L 

dfi^ 



+ X ‘/2 


T 


2 


The expectation of 


d^L 


IS 


X N 
X - 2 0-2^ 


so that the variance of the estimate 


of fji is 


X - 2(r2 
X N 


The eflBiciency of the moment estimate is, therefore, 


(X — 2)/X = 1 — ^ 7 i 2 . This is practically equal to 1 for very small skewness 
but for 7 i = 1 is only 50%. 

12.6 The Chi-square Test of Goodness of Fit. If Npt is the expected 
number of observations in the ith. class, and if ft is the observed frequency in 
that class, we have seen in § 5.12 that the quantity 


(12.27) 


X/ 


2 (/. - Np.y/Np, 




is in the limit as iV oo distributed as with fc — 1 degrees of freedom. 

If the quantities p* depend upon some parameter 6, 6 may be estimated by 
maximizing the likelihood. By equation (5.75) the probability of the ob- 
served sample is 

SO that 

(12.29) L = C + 2Alogp.- 


The equation for estimating 6 is, therefore, 


(12.30) 


— = 'V ^ = 0 

dd ^p, de 
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and the variance in random samples of this estimate of 6 is given by 


(12.31) 

since E{f^) = JVp,. 


1 






2 


Remembering that = 1, we see that the first term in (12.31) vanishes, 
so that 


(12.32) 



Now from (12.27) 


X/ 



so that if d is estimated by making a minimum, the condition is 


(12.33) 


^ Np^^ 66 


= 0 


which is not quite the same as (12.30) since ft in general differs from Npt. 
However, as iV — ^ oo, the ratio ft/Npt 1, for every value of i, so that in 
large samples the method of maximum likelihood is practically equivalent to 
minimizing It should be remembered that it is only for very large 
samples that x*^ has exactly the x^ distribution. 

Now, in curve fitting, the true values of p* have usually to be estimated 
from the sample. If the method of estimation is most-efficient, the value 
of will be a minimum or very near it, but if the p* are estimated by an 
inefficient method the calculated x®^ will be too large, depending not only on 
the deviations of the observations from hypothesis but also on the errors of 
estimation of the parameters.^ 

Let THt = Npi, the true expected frequency in the ith class, and mf the 
frequency calculated from a most-efficient estimate T of a parameter of which 
the true value is 6, Writing dT 6 — Tj we have 


-1 

Mt 







If is the calculated value of 
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Since is a minimum, 2 

w/‘ dT 


0, so that 


X/- 





4 - 


As the size of the sample increases, T approaches dj so that 8T 0, and also 
ft/nit 1. For large N, therefore, we have approximately 


(12.34) 

since = iST, which is independent of T, 

Now since mi' is calculated from a most-efficient statistic T, the variance 
of this statistic is given by 


whence 




(12.35) - x/^ = (dT/<ry 

Since E(T 6) = 0 and Var (T ~ ^) = dT/a is a standard normal 
variate. It is, moreover, independent of x/^j which is a minimum value for 
variations in T. It was shown in § 5.12 that x«^ is distributed as a sum of 
squares of fc — 1 independent standard normal variates. Hence we see from 

(12.35) that x/^ is distributed as a sum of squares of Jk — 2 independent 
standard normal variates. That is, has the x^ distribution with fc — 2 
degrees of freedom. 

This, however, is true only when we use a most-efficient statistic. If x*"* 
is the value calculated by using a statistic T' of efficiency e and therefore of 
variance e, we have, by the same argument as above, 



where dT^ — — T, The variance of 51" is given by 

Var ihT) = Var (T') + Var (T) - 2 Cov (T, T') 

= <T\l/e + l) - 

Now the coefficient of correlation^ between a statistic of efficiency e and a 
most-efficient statistic for estimating the same parameter is so that 

(12.37) Var (sr) = aW/e - 1) 

The mean value of x«'^® “ x/^ is, therefore, 1/c — 1, and since the mean 
value of is fc — 2, that of x&'^ is A — 3 + 1/c. The distribution, however, 
is not that of x^, so that, when e is lovr, the mean value of the usual chi-square 
is altered and the tables of goodness of fit no longer apply. 
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12.7 The Correction of an Inefficient Estimate. Since inefficient estimates 
are often much more easily obtained than efficient ones (for example, by the 
method of moments) and since, if the distribution is not very far from normal, 
the loss of efficiency may not be serious, it is sometimes worth while to apply 
a small correction to an inefficient estimate rather than go through the labor 
of finding the most efficient one. 

If r is a most-efficient estimate and T' an estimate of efficiency e, then, 
as we have seen in (12.37), the variance of T' — T is — 1). If L is 

the likelihood, ^^^d also, for large Nj — approxi- 

mately. Now, by Taylor ^s theorem, 

if squares and higher powers of T' — T may be neglected. Since for values 
of e near 1 the variance of T' — • T is a small fraction of this approximation 
is Justified if the efficiency of T' is fairly high. We have, therefore, as an 
approximation 

(12.38) r - r + 

which enables us to correct T" to bring it nearer to T. 

Example 6. In § 12.5 we saw that the moment estimate of the parameter m of a Type III 
distribution with given and X is T' = dc, while the maximum likelihood estimate T is given 
by solving the equation 

S + X*'*]'* = - 1)-1 

2 

which is of degree iNT. The variance of T is — Hence, from (12.26) and (12.38) we 
have approximately 

(12.39) = f - j^(X - 1) (j-~ + ^ 

Prom the data of Example 3 in § 5. 14, assuming that the parent population is of Type III 
with or =s 138 milliseconds and yi = 0.4 (X ~ 25), we have T' = 204 milliseconds. The 
1 /a; — ^ 

value of — h X^/* 1 for the grouped distribution turns out to be 0.2085, whence 

T = 203.5 milliseconds. The correction in this case is quite small. 

Example 7. Ability to taste phenyl-thio-carbamide is known to be inherited as a Men- 
delian dominant character. If p is the probability of possessing th^e gene T and q that of 
possessing t, the only non-tasters are those with two t genes, while individuals with TT or Tt 
are tasters. In a random"coll^tion, therefore, the probability that any given individual is 
a non-taster is qK 

The value of q was estimated for chimpanzees from a collection of 28 animals. Of these, 23 
were unrelated and included 5 non-tasters. The probability of this sample is 0(^)®(1 — g») « 
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A group of 3 animals (2 parents and a child) were all non-tasters. Since if both parents 
possess U genes the child is necessarily a non-taster, the probability of this group is q^. An- 
other group of 2 (mother and child) were both tasters. It can be shown by working out the 
various possible cases that the probability of this combmation is (1 - ^)(1 -j- 5 ~ q^)- 
For the whole sample the proportion of non- tasters was 2/7, which gives as a rough esti- 
mate of q the value (2/7) ^^2 = 0.5345. 

The probability of the combined sample is P == Cq^%l — $2)18^4^ g)(l ^ q ^ so 
that L = Cl + 10 log g + 18 log (1 — § 2 ) -f 4 ]og q 4 - log (1 — g) 4 - log (1 -f g — q^). 
Hence 

^ ^ 14 _ 36g _ __1 2g - 1 

dq q I ^ qi i-g 14 -^_g 2 

The maximum likelihood equation for q is thus a fourth degree algebraic equation, 

14 + 14g - 68 g 2 - 4 - 53g^ = 0 

This can be solved by systematic trial and error (bracketing) and gives the estimate 
q = 0.5141. 

In order to use the approximation (12.38) we need to evaluate dL/dq tor q = 0.5345 
(which gives —2.96) and to estimate <r<, 2 . The latter can be found from d^L/dq^ and is 
0.00685. The corrected rough estiniate of q is, therefore, 0.5345 — .00685 X 2.95 = 0.5143, 
extremely close to the true maximum, likelihood value. 

12.8 The Neyman-Pearson Theory of Confidence Intervals.^ Several ex- 
amples of estimation by confidence intervals have been given in earlier 
chapters. In general terms the problem is to find two functions of the sample 
S, say 6(S) and 6{S), such that 

(12.40) Pr [iiS) <eo<e(S)\eo} ^1-a 

where a is a fixed number between 0 and 1, say 0.05. That is, the probability 
that the interval between d(S) and 5(8) includes, or covers, the true value do 
is to be 0.95, whatever this true value may be. 

Let T be a statistic used to estimate 6, and let g(T, 0) be the frequency 
function of T. For any value of 6 we can find two quantities 71 and 72 
(depending on B and a) such that 

Pr {yt<T <y,\e} ^ ^ “ 

This can be done in infinitely many ways, since all tihat is necessary is that the 
sum of the areas of the two tails, one from — 00 to 71 and the other from 72 to 00 
shall be equal to a. As ^ varies, the points ( 71 , 6) and ( 72 , B) describe curves 
Cl and C 2 in the plane of T and By and in most cases each of these curves will 
be cut by a straight line parallel to the axis of 6 in one point only. (See 
Figure 16, Chapter VI.) If the ordinates of the points on Ci and C2 where the 
curves are cut by a vertical line T = constant are d{T) and B(T)y where these 
functions depend, of course, on a, then the region between the curves Ci and C 2 
is characterized equally well by 71 < T < 73 for 8 ^ 0/or hy C2 < B < for 
ail r. Heiice it follows that 

Tr{i<B<B\B\ 
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where 0 aad 0 are functions of T and therefore of the sample S, since T is a 
statistic calculated from S. The probability is interpreted in the sense 
explained in § 6.4, 

The quantities i(T) and d(T) are lower and upper confidence limits respec- 
tively, corresponding to the confidence coefficient 1 — a, or, to phrase it differ- 
ently, at the level a. The risk of accepting the hypothesis that the confidence 
interval includes the true value when in fact it does not do so is a. 

There are several arbitrary elements about this procedure: 

1. The choice of the statistic T, The same parameter may be estimated 
by more than one statistic. The population mean, for example, may be esti- 
mated by the sample mean or the sample median. 

2. The choice of the confidence coefficient. This depends on the risk of error 
one is prepared to take, and toU vary according to the seriousness of the con- 
sequences that may follow from making an error of this kind. 

3. The division of the risk between the upper and lower tails of the distribu- 
tion. The risk that 6 is below B(S) need not be equal (although it often is so) 
to the risk that 6 is above d{S) . In some cases we may be particularly desirous 
not to underestimate 6, and in other cases not to overestimate it. 

4. The sample size. The larger the sample the more accurately 6 can be 
estimated from it, and hence the narrower the confidence interval for a given 
confidence coefficient. 

It is clearly desirable, as a general rule, to have a confidence interval as 
short as possible. The confidence interval 5 from ^ to 5 is said to be a shortest 
confidence interval, corresponding to the coefficient a, if the condition (12.40) 
is satisfied and if, 5' being any other confidence interval from Bf to B' also 
satisfying (12,40), h is less likely than 5' to cover a/afee’value Bi when the true 
value is 0o- Symbolically, 

(12.41) lPv{B<Bi<B\ do} < Pr {B' <di<d^\ Bo] 

for all values of Bi and whatever the true value Bo may be. Unfortunately it is 
quite exceptional to be able to find such a shortest confidence interval. It 
may be remarked that the word shortest'^ as used here does not necessarily 
mean shortest in the sense of having minimum length, and Kendall has pro- 
posed to use instead the term “most selective.^' 

A confidence interval is unbiased if the probability that it covers Bo when the 
true value is is 1 — a but the probability that it covers Bi when the true 
value is Bo is always equal to or less than 1 — a, whatever the values of Bi 
and ^ 0 . The probability of covering a false value Bi is, therefore, never greater 
than the probability of covering the true value, no matter on which side of the 
true value Bi may lie. -This is the sense in which the interval is “unbiased.^^ 
If condition (12.41) is satisfied when both the confidence intervals mentioned 
are unbiased, then 5 is a shortest unbiased confidence interval. Sometimes we 
can find shortest one-sided confidence intervals, for which Pr (5 covers i < 
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Pr (5' covers 6 i\ ^o} for all di on one side only of ^o, that is, for which 0 i — is 
always positive or always negative. 

12.9 A Geometrical lUustration of Confidence Intervals. It is often con- 
venient to think of a set of sample values xij ^ 2 , * * • x^r as determining a point E 
in an iV-dimensional Euclidean space, the 
sample space IF. The frequency function 
of X determines a density associated with 
each point. 

We suppose that the frequency function 
depends on several parameters 6 , B', B" • • *, 
of which we desire to estimate B, We add 
a new dimension to the sample space, pro- 
viding an axis of B^ as shown in Figure 40 
for the case W = 2. For each point E the 
quantities B{E) and BiE) determine points 
U and L on a line through E parallel to 
the axis of B. 

For any given value of B, say the 
space IF in the diagram is plane. Let 
E be the point corresponding to a par- 
ticular set of sample values, and let the 
line through E parallel to the axis of B cut the space IF in Ei. Then all the 
points El which are such that the points L and U lie on opposite sides of Ei 
constitute the region of acceptance A, and the interval from L to U covers 
or includes Ei. If, on the other hand, Ei is outside the region of acceptance, 
say at Ei\ the points L and U lie on the same side of Ei, and the interval does 
not include Bi, Hence, for any fixed Bi, 

Pr {^ < < ^} = Pr {EeAiBi) \ B, B' - •} 

where the notation EeA (Bi) means that E is an element of the region of accept- 
ance corresponding to the value Bi of the parameter B. (The points E and Ei 
are the same, as far as the sample space is concerned.) It follows that, if the 
confidence limits are determined so that Tv [B < B <9 \ 9, ^ = 1 — a, 

then for all 0, 

(12.42) Pr {E€A{B) [ 0, r • • •} - 1 - a 

The region i4(^) cannot, therefore, be empty for any permissible value of B. 

If the functions B and B are single-valued and determined for all points E, 
there must be at least one region of acceptance for every E. Moreover, if a 
sample point falls in the regions AiBi) and * 4 (^ 2 ) corresponding to two values 
Bi and B 2 of B, it will fall in the region A (B 3 ) corresponding to any B^ between Bi 
and 02 , and vice versa. These conditions ate suflficient, as well as necessary, so 
that in order to find the confidence limits for any given sample we can take 
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the upper and lower bounds of the interval of 6 for which E lies within the 
region of acceptance A (6), 

12.10 The Determination of Confidence Intervals. If we have a sufficient 
statistic T for estimating 6j the likelihood function is 

L = Li(r, e) + L^ixiy X 2 ‘ — xn, 9\ d" • • •) 

The equation T = constant determines a set of surfaces in TF, and the con- 
dition T < k determines a region K, The probability that E falls in K de- 
pends only on T and 6, and by appropriately choosing k we can make this 
probability equal to 1 — a. Confidence intervals can then be set up as in 
§ 12.9. 

Example 8. For a normal population with mean ft and variance the sample mean 5 is a 
sufficient statistic for estimatmg ii. The likelihood function can be written 

L = c-N\oe<r-:^(£-„y -^,S{xa - xy 

which shows the sufficiency. It is known that x is normally distributec^with mean /i and 
variance <r^/Nj so that if the region K is determined by l^r — ^ ra<r/^Nj where r« is that 

J '*G0 

4>{t) dt — then Pr [EeK] = 1 — a. 

’’a 

The confidence limits are given by 
We are here supposing that a is known, even though n is unknown. 


If the standard deviation or other scale parameter is unknown, we can 
sometimes use the method of ‘^studentization.’^ Suppose that in Example 8 
above, both p. and <t are unknown. Since x is normally distributed, and since 
the sample variance is distributed as a^x^/N independently of x, the quantity 

X — 


t = 


s/n' 


1/2 


where n = iV — 1, has the Student ^distribution, given by 




n+1 

2 


X -<a /•« 

f({) dt — I f(i) dt = a/2, we have 
Pr {-<« < < < ^«} = 1 - a 

which is equivalent to 

Pr {x — ^ ^ X "jr sta^ = 1 — a 

The confidence interval is now a function of the sample values only, once a is 
given. 

This method depends upbn the possibility of finding a function of the param- 
eter to be estimated whose distribution is independent of any unknown 
parameters. Such a fortunate state of affairs does not occur very frequently. 
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Example 9 (Neyman'^). Suppose the variables Xi and are independently and rec- 
tangularly distributed on the range from 0 to B, It is required to determine a confidence 
interval for B from a single sample of the two variables, with confidence coefficient 3 / 4 . 
The jomt frequency function is 

/(*!, xs) 0 <xi <e, 0 <Xi <e 

and zero elsewhere. 

If m Figure 41 (a), we regard the shaded region as a region of acceptance corresponding to 
i he value Bj and consider the set of all such regions for all values of 6 (lying in parallel planes 



(a) (b) 


Fig. 41 

as in Figure 40), then it is easy to verify that the conditions governing regions of acceptance 
given at the end of §12.9 are fulfilled. The point E{xi,X 2 ) lies m the region A{B) 
if 0/2 < 4* a ;2 < 30/2, and this condition may be expressed as f (xi +X 2 ) <6 <2ixi x{). 
Hence the confidence limits are ^(xi -1- 0 : 2 ) and 2(xi + 0 : 2 ). 

Various other confidence limits corresponding to the same confidence 
coefficient may be assigned. Thus the shaded region in Figure 41 (h) is also 
a region of acceptance, which may be characterized by 6/2 < L < 6, where 
L is the larger of the two values xi and Expressing this condition as 
L < 6 < 2L, we see that the confidence limits are L and 2L. Since L is 
certainly less than + 0 : 2 ), the confidence interval in case (6) is shorter 
than that in case (a). It may be shown that case (6) actually gives a shortest 
interval in Ne 5 nnan's sense. 

12.11 Confidence Intervals and Fiducial Inference. R. A. Fisher has 
developed in a series of papers * a theory of fiducial inference which is distinct 
in principle from the Neyman-Pearson theory, although in most practical 
problems the two theories give identical results. One problem in which they 
differ has been discussed in §.9.8, namely, the Behrens-Fisher problem. 

In the theory of fiducial inference we imagine a hypothetical distribution 
of conceivable values of the parameter 6 to be estimated. It is not, of course, 
necessary to assume some actual a 'priori distribution of 6, for then we should 
be doing what Bayes did (§ 1.10), If T is a statistic used to estimate 6, 
and if it is a sufficient statistic, the fiducial distribiition of 0, which is derived 
from that of T, is unique. No problem of selecting the ‘^best” or “shortest^' 
interval arises if sufficient statistics alone are used 
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If F(T, B) is the distribution function for T, the frequency function is 
given by 

( 12 . 43 ) dF = dT 

Now F(Ti, 6) is the probability that for a fixed value of 0 a random value of T 
mil not exceed Ti, and this is taken as the fiducial probability that for a fixed 
value of Ti a random value of 6 will exceed Ti, Note that in Figure 16 of 
§ 6.4, the locus of the lower bound of ^ as ^ varies is also the locus of the upper 
bound of ^ as ^ varies. The fiducial distribution of 6 is therefore given by 

( 12 . 44 ) dF =- de = fie) de 

ad 

It is assumed that the variate x is continuously distributed. The fiducial 
distribution* of S is not an a priori probability and is not to be interpreted in 
a frequency sense. The idea is (the word “fiducia^^ means trust) that 

X Bi 

f(6) dd is a measure of our belief that 6 lies between $i and $ 2 . 

Example 10. For a normal population with mean ju and variance where <r* is known 
and /A is to be estimated, the distribution of the sample mean x is given by 

dF = <i2 

since £ is normally distributed about ^ with variance cr^/N. Hence 

At) = r (iV/27r)i/V~i exp [-N{u - At)V2<r»} du 

and 

_ ggfe j f) . = P ^ (m - 

o/A J—cc 2 i(T* 

The fiducial distribution of /t is, therefore, a normal distribution with mean ^ and variance 
o-s/iV'. Consequently the fiducial limits are precisely the same as the confidence limits of 
Example 8, with known <t. 

The Behrens-Fisher problem is concerned with the difference of the means 
of two independent normal populations with possibly different variances. 
If Xu, Xi 2 , • * • xiN is a random sample of N observations from the first popula- 
tion, in the order in which they are observed, and if X 21 , X 22 , * • * xm is a random 
sample of AT' observations from the second population (iV' > N)j also in the 
order of occurrence, then by sacrificing some of the observations from the 
second sample, we can form confidence limits for the difference h of the two 
means. 

If = Xu ““ X 2 tf i = lp2, • • • Nj we have 

/ EM = d 
\ Var (ni) = err + <t 2 ^ 


(12.45) 
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where ar and are the two variances. The consecutive u^ are normal and 
independent and we shall have the confidence interval 

(12.46) u-taS<d<u + taS 

where u is the mean of the Ut, == ^(ut — uy/N(N — 1) which is an un- 
biased estimate of /N, and is the value of Student's t (with N — I 

degrees of freedom) corresponding to the confidence coefficient 1 — a. 

In an experiment consisting of repeated sampling from a pair of normal 
populations, the relative frequency of cases in which the statement (12.46) 
would be true, whatever the values of the respective means and variances, 
would be approximately I — a. 

Fisher's solution of the problem has been given in § 9.8, leading to fiducial 
limits expressed by (9.27) and calculated with the help of Sukhatme's tables. 
However, as pointed out in that section, these fiducial limits are not confidence 
limits. 

12.12 Tests of Hypotheses. In many practical problems samples are ex- 
amined in order to test some hypothesis about the parent population, for 
example, that the mean is not greater than some specified value, or that the 
proportion of individuals with some'definite 
characteristic lies within assigned limits. 

Two kinds of errors may be made regard- 
ing this hypothesis: (1) The hypothesis 
may be rejected when it is really true (this 
is called an error of the first kind ) ; (2) the 
hypothesis may not be rejected when it is 
really false (this is called an error of the 
second kind). The usual testing procedure 
is designed so as to limit the risk of errors 
of the first kind to a specified value (such 
as 0.05) and at the same time to reduce as 
far as possible the risk of errors of the 
second kind. 

As a simple example, consider the estima- 
tion of the mean of a normal population 
from samples of a fixed size. If the true 
mean is, say, 100 and if the standard error 
of a sample mean is 1, then, for a confidence 
coefficient of 0.95, the region of acceptance 
of the hypothesis that the true mean is 100 
can be chosen in many ways, including 
the three indicated in Figure 42. The 
intervals (a) 98.35 to oo and (5) 98 to 102 and (c) — qo to 101.65 are each such 
that if a sample mean falls outside this interval, which will happen with a 
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probability of 0.05, the hypothesis that the true mean is 100 will be rejected 
even though in fact this hypothesis is true. The risk of an error of the first 
kind is, therefore, in each case 0.05. If, however, the true mean is 100, the 
probability of accepting a value different from 100, say 101, is greater in (a) 
than in (6) and greater in (b) than in (c), the probabilities being indicated by 
the shaded areas under the dotted curves with mean 101. The probability 
of committing an error of the second kind is, therefore, lea§t mth the interval 
(c) and greatest with (a), and this will be true whenever the false mean is 
greater than the true mean. For a false mean less than the true mean, say 99, 
the conditions are reversed and the probability of an error of the second kind 
is least with the interval (a) and greatest with (c). 

In the general problem of testing a hypothesis we have a set of random 
variables Xi,,Z 2 * * • Xn, with a joint probability distribution F{xij X 2 • * ' xn)* 
The statistical hypothesis Ho, is that F(xi, ■ • • xn) belongs to a certain sub- 
class CO of distribution functions out of the whole class Q of possible distribution 
functions. If the class co consists of a single element, the hypothesis is said 
to be simple. Otherwise, it is called composite. The hypothesis that the 
variables are normally and independently distributed with mean 0 and vari- 
ance 1 would be a simple hypothesis, because this specification determines F 
uniquely. 

The set of observations Zi, Z 2 • • • Xn determines, as noted previously, a 
point E in the sample space W. In order to test the hypothesis Ha> we have 
to choose a certam region A in W such that, whenever E falls outside A, is 
to be rejected, and whenever E falls wdthin A, Ho, may be accepted. The 
region A is called the region of acceptance. Commonly we have, or assume, 
a certain amount of information about the function F(xi, • • • xn), as for 
instance that the variables are independently and normally distributed with 
the same distribution. In this case 

dF = 'W (i>{x^dx^y where <j>{x^ = — 

i V 2tc 

and the hypothesis can refer only to the magnitudes of or o* or both. 

Problems of testing hypotheses and problems of estimation are both parts 
of statistical inference, although they do not cover the whole field. The 
following problem, for instance, which arises in the quality control of manu- 
factured products, is not precisely in either of the above classes. Assuming 
that in respect of some measurable quality the product should lie between 
certain specified limits, we have to decide on the basis of a sample whether 
a given batch of material should be classified as (a) between these limits, 
(h) above the upper limit, or (c) below the lower limit. 

12.13 The Power of a^'Test. The size of the region of acceptance A is 
measured by the probability that the sample point E falls within A, calculated 
on the assumption that the hypothesis Ho, is true. In other words, the size 
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of this region is 1 — a if the probability of committing an error of the first 
kind is a. Thus if the distribution function F{xi, • • • xn) depends on a 
certain unknown parameter and if the hypothesis iJo is that $ = 

(12.47) Pr {EeA \ do} ^ j dF(xi, X 2 • - xk, do) 

(A) 

= 1 — a 

If the alternative hypothesis Hi is that 6 = 6i, the probability of committing 
an error of the second kind; that is, of accepting Ho when Hi is true, is 

Pr {EeA 1 = /J 

The quantity 1 — /? is called the power of the region A with respect to the 
alternative hypothesis 6 = di. 

If the power is plotted for different values of di we get the power curve of the 
region A. 


Example 11. If a:i, X 2 * • 'Xn is a sample of N independent observations from a normal 
population with variance 1 and unknown mean /*, 


(12.48) dF{xi, X 2 • • * XNi ti) = (22r)~^/2 exp [— ^ daji • • • dxn 


Let the hypothesis Hq be that /* = 0. The region of acceptance for a confidence coefficient 
of 0.95 may be selected as that region A for which \£ \ < 1.96iV~^/2, since 


Pr {EtA ] M = 0} = (27r)-W5j. . . J e-lsv dxi ■ ■ ■ dxN 

W 

and if we make an orthogonal transformation to new variables * yN, of which yi = 

we obtain gg 

(12,49) Pr {EeA 1 ^ = 0} - (27r)-^ f e-^y^^dyi 

J-l 96 

-0.95 


If the alternative hypothesis Hi is that == the probabOity of accepting Ho when Hi is 
true is 

Pr {EeA I M = Mil - (27r)-^/* 

(A) 


J • -J* exp[— I- ^(x^ — /ii)2] dxi — dxN 


By a similar transformation to that used m (12.49) we can show that 
(12.50) Pr {EeA [ (2x)~i dyx 

JSi 

= ^(so - $(Si) 

where 8i =-1.96 - and Si = 1.96 - wAT"*. 



Fig. 43 
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Hence the power of the region with respect to the hypothesis Hi is 1 4* **>(50 — <E>( 52 ). 
This is plotted for different values of p,i (and for N = 100) in Figure 43. 

12.14 Uniformly Most Powerful Tests. The region of acceptance corre- 
sponding to a given value of a can, in general, be chosen in many ways. If A 
and A' are two regions of the same size, and if the power curve of A' is below 
that of A for all values of then A' is not as good as A for testing the hy- 
pothesis 0 = di. For it is clear that since A and A' are of the same size, the 
probability (a) of an error of the first kind is the same for both, but the 
probability (0) of an error of the second kind is less for A than it is for il'. 
That is, in the long run we shall more frequently go wrong if we use the region 
A' to distinguish between the hypotheses Ho and Hi than if we use the region 
Aj no matter what the true value of 6 may be. We then say that the test 
using A is uniformly more powerful than the test using A', If this is so for 
all possible alternative regions A' of the same size as A, the test using A is a 
uniformly most powerful test. If such a test can be found we shall naturally 
be quite satisfied with it, as no better one could be devised for distinguishing 
between Ho and Hi at the specified level a. This situation, however, rarely 
arises. Usually the power curve for another region A ' will be above that for A 
in some parts of the range of d and below it in other parts. 

Thus, in Example 11 above, we can choose as A' the region for which 
X < which also has a size 0.95. The power of A' with respect to 

Hi is 1 ~ #(1.65 — The curve is plotted in Figure 43, and is above 

that for A when mi > 0 and below that for A when fxi < 0 . Hence if we have 
reason to believe that the unknown m is positive, we shall prefer a test using A' 
to a test using A, but if we have no reason to expect positive values rather 
than negative values we shall probably prefer the test using A, which is more 
symmetrical. 

The test using A, in fact, satisfies another criterion introduced by Neyman 
and Pearson, namely, that of being unbiased. A test is unbiased if its power 
curve for testing the hypothesis that 6 = 6o has a minimum at the value 0o. 
If so, the probability of rejecting the hypothesis is smaller if d is really Bq than 
if e is really some neighboring value Bi, and this is naturally desirable. Since 
it is usually impossible to find a uniformly most powerful unbiased test (one 
which is uniformly more powerful than any other unbiased test), Neyman and 
Pearson have suggested using an unbiased test which is most powerful in the 
neighborhood of ^o. That is, if P(A | 6) is the power of the region A with 
respect to the hypothesis that the unknown parameter is equal to then we 
should choose A so that 


’ dP(A I 0) 

" BB 
B^PjA [ B) 




1 >r 


( 12 . 51 ) 


(а) 

( б ) 


[ 0 ) 
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for all A' satisfying (a) and of the same size as A, The second condition 
requires the curvature of the power curve at the ncdnimum to be greater for A 
than for any unbiased alternative A\ 

It may be proved that these two conditions can be satisfied in many prac- 
tical cases. In Example 11 above, the region A satisfies the more stringent 
conditions of being a uniformly most powerful unbiased test. 

The chief objection to the criterion (12.51) is that in practice we are more 
interested in distinguishing between widely separated values $q and 6i than 
between values which are close together. For very large samples, however, 
the difficulties of finding uniformly most powerful tests are greatly lessened, and 
asymptotically most powerful^ tests may be shown to exist in most cases of 
practical interest. A similar situation exists with regard to confidence inter- 
vals. It is frequently possible to obtain asymptotically shortest ynbiased con- 
fidence intervals, even in cases where shortest unbiased confidence intervals in 
the sense defined in § 12.8 do not exist. 

12.15 Confidence Regions with More than One Parameter. Let us suppose 
that the distribution function of the population depends on a set of parameters 
^ 1 , 02 * • * Oh, of which we desire to estimate a sub-set, 6i • * • $m(m < h). The 
remaining parameters are nuisance parameters. Any one particular set of 
values of 0i, 02 * * * 0m determines a point P in an m-dimensional parameter 
space. If we can find a region Aq of this space, determined entirely by the 
sample, such that Pr {PeAe] = 1 — a, whatever the true values of any of 
the parameters 0i, 02 • • • Oh, then As is a confidence region for P with confidence 
coefficient 1 — a. 

It will be seen that the estimation of a single parameter is a special case of 
this, where the region Ae reduces to a line interval, bounded by 0 and 0. 

Example 12. Suppose we have two independent samples of sizes iVi and Ns from normal 
populations with different means fn and jx 2 and a common variance aK It is required to find 
a confidence region for the two means. 

The quantities Si/<r^, Nidi^/a^ and Nids^/a-^j where Si and Sz are the sums of squares 
for the two samples, di = Xx — and ds = Xz — M 2 , are independently distributed as 
X* with iVi — 1, ^2 1, 1 and 1 degrees of freedom respectively. * Hence (Si 4* S 2 )/<t^ and 

(iVidi® + Nsdz^)/a^ are distributed as with iVi + ^2 — 2 and 2 degrees of freedom, so 
that 

p _ Ni+Nz-2 Nidi^ 4 Nzdz^ 

2 Si Sz 

has the Snedecor F distribution with 2 and iVi 4 iV '2 — 2 degrees of freedom. If Fa is 

J r*F 

“/(F) dF = 1 “ a, /(F) being the frequency function for F, the probability 

0 

that F < Fc* is 1 — a. 

This is the probability that 

Ni(Xi - my 4 NsiXz - my < 2Fa(Si + Si)/(Ni + Nz - 2) 

that is, the probability that the point (^i, m) in the pmz plane lies inside the ellipse with 
equation 
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(12.52) JVidi - = 2 Fa(.Si + &)/(JNri +N 2 - 2 ) 

The ellipse is the confidence region A in the two-dimensional parameter space of m and fiz. 
In this example the nuisance parameter tr* disappears. 


12.16 Composite Hypotheses. If the distribution under test depends on 
r + s parameters • * * 6r, ^r+i • • • ^r+a, a null hypothesis Hq which specifies 
the form of the distribution and the s parameters ^r+i * * • Or+8) while leav- 
ing the others unspecified, is a composite hypothesis of r degrees of freedom. 
Additional specification of the parameters Bi* - • Br will give a simple hy- 
pothesis, ifi, and the problem is to find a region of acceptance of a fixed size, 
whatever the values of • • • Br, such that errors of the second kind are 
minimized for all admissible simple alternatives Bi. 


Example 13. A sample of size N is taken from a normal population of mean /x and vari- 
ance /r*. If o; IS unspecified, the hypothesis Hq (that /x = /xo) has 1 degree of freedom. 

We wish to find a critical region A such that podxi^ • • dxN = 1 — a, whatever the 

value of <r, po being the probability of the sample calculated on the hypothesis Eq. Among 
the admissible regions we then have to find the best (that is, the one which minimizes error 
of the second kind) for the alternative simple hypothesis Hi (that /x = ;xi and a — <ri). 

Now the likelihood L is given by 

(12.53) L = loepo = - ^ log (2w») - ^ ^(a:. - n^y 


= C - IV log <r - ^ [(£ - mo)* +«»] 


where £ and s* are the sample mean and variance respectively. Hence po is constant over 
the surface of the hypersphere 

^(xt — jLxo)* — constant 
i 


and A may be chosen by taking on each hypersphere an area which is a fraction 1 — a of its 
total area and combining these fractions. This area is clearly independent of cr. 

Our problem is now to minimize I pidxi^ • • dxNj where px is the probability of the 

J{A) 

sample calculated on hypothesis Hi, subject to the condition i po to • • - dxN = 1 — 

J(A) 

If We can do this we reduce the chance of error of the second kind to a minimum while re- 
taining the chance of error of the first kind at the fixed value a. 

The problem is equivalent to that of minimizing (without restriction) the integral 



to • • * ciXN, where X is a Lagrange undetermined multiplier. If we choose 


for the region A all points for which pi — Xp© < 0 and exclude all points for which 
Px — Xp& > 0 WB shall obviously make this integral as small as possible. The boxmdary of 
the region A is, therefore, given by 

Pi = Xpo 


or 

If 

Q,2M) 


log Pi — log po » Xi 


Li =logpi 


N 


^ C - ATlogcri - ^ 
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and if m (12.53) we put <r = o-i (since the choice of A is independent of the value of <r), we 
obtain 

— Mo)2 — (f — JUl)2] 

or 

^(ah — mo) = tri^Xi/iV 4“ J(jUi2 — /Ao“) = ci(iui “ Mo)j say. 


The region A on a given hypersphere should therefore be taken as the whole surface outside 
the “cap’’ cut off by the hyperplane x = ci, ci being chosen so that the area of this cap is a 
fraction a of the area of the hypersphere. The boundaries of these caps all lie on a hyper- 
circular cone with vertex at the point (jiq, ho • • ho) and axis equally inclined to all the 
coordinate axes. 

If HI > MO the region corresponds to x < Ci, and if hi < Mo It corresponds to > ci, and 
these regions are independent of mi- Hence either for hi > Mo or for hi < mo (but not for 
both) the test using this region of acceptance A is a uniformly most powerful test. 

From the geometry of the hypersphere it follows that the fractional area of the '‘cap,” 
which is equal to a, is given by 


(12.55) 


a 





dt 


where t = n}f^{x — ho)/ s and n = A'' — 1. But this means that ta is the value of Student’s t 
as given by the Fisher tables corresponding to P — 2a, The one-tailed t-test, according to 
which the hypothesis Ho is rejected when t > ta^ is, therefore, a uniformly most powerful test 
for the class of admissible alternative hypotheses Hi (h = Hi > ho), whatever a and <ri may 
be. A similar statement holds when hi < mo, the hypothesis being rejected when —t>ta. 
If Ml may be either greater or less than ho no uniformly most powerful test exists, but the 
common procedure of the two-tailed t-testj that is, rejecting Ho when | ^ | > tat, ta being now 
the value of t corresponding to P = a, provides a uniformly most powerful unbiased test. 


12.17 The Power of the i-Test. If we can calculate the power function of 
the ^-test for some value mi different from po, we can determine the probability 
(1 — 0) of rejecting the hypothesis Ho if the alternative hypothesis Hi is true. 
We naturally want this probability to be reasonably large. Unfortunately 
the power function is not independent of cr, and indeed Dantzig has shown 
that no test of the composite hypothesis Ho{p — po, whatever the value of cr), 
can have a power function independent of a. In many practical cases we have 
little or no prior knowledge of <r, but if we can estimate it roughly we can use 
tables of the power function to assist in laying out an experiment designed to 
detect a difference between po and pi of a given order of magnitude. 

Ne3rman and Tokarska have tabulated the power function for Student^s 
i-test (one-tailed), and more recently Lehmer^^ has given inverse tables of 
probabilities of errors of the second kind. Johnson and Welch have 
provided fairly extensive tables of the non-central t-distribution, the probability 
integral of which gives the power of the t-test. In Neyman and Tokarska^s 
tables the argment is p == A /<r, where A = pi — po. The alternative 
hypothesis is that p = pi, where pi > po- This type of hypothesis is suggested 
when, for example, a new treatment is under investigation for increasing some 
desirable property of a crop, such as the sugar yield of beet. If the mean 
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yield Ml under the new treatment is definitely superior to that under the old, 
Mo, it may be worth while to change over to the new treatment, but we shall 
not be interested in changing if mi < Mo- 

The probability of accepting the hypothesis that ju = mo, when in fact 
M = Ml, is equal to the probability that t in (12.55) is less than calculated 
on the assumption that /I = mi- This is the probability that — mo) < faS 

when the joint probability density for x and s is given by (sec eq. (7.33)) 


(12.56) 


f(x, s) = 


'n + IV'" 1 2_ 

. 2t / (T® r(n/2) 



This expression can be -written down at once by remembering that for a normal 
parent population x is normally distributed about mi ''nth variance o-yiV, 
and is independently distributed as with iV — 1 (=n) degrees of 

freedom. 

The probability of error of the second kind is, therefore, given by integrating 
f{x, s) for X from — oo to mo + and for s from 0 to oo . Hence 


(12.57) 




+ 


J r(w/2)o-"+i 





(27r)- 


•1/2/: 


dz ds 


^ (n + ly^Kx - fix) /n+iyfUaS .. 

On putting z = » V == f — — > equation (12.57) re- 

duces to 

2 / -w \”/2 

(12.58) ^ = ^^( 2 ^- 2 ) / {2r)-^'H-^i^dzdv 

- y ’'’jf 

where p ~ (n + A/ cr. For a given a, (12.58) expresses jS as a function 
of n and p. The integral can be evaluated numerically for given n and p, 
and we then have the power of the ^-test for a given value of A, 

The tables can be used to determine the value of pi necessary to reduce the 
risk of errors of the second kind to a specified value. Suppose, for example, 
that po == 100 and <r == 10, with N = 16, and that we wish to determine pi so 
that the risk is equal to the same value 0.05 for each kind of error. With 
a — p = 0.05, we find from the table that when n = 15, p == 3.45. Since 
p = 4A/(r, this gives A = 8.62. If, therefore, pi is at least as great as 108.6, 
there is at least a 95% chance of detecting the difference between pi and po in 
a sample of 16, at the 0.05 level of significance. This level indicates that the 
probability of rejecting the hypothesis that p = po, when in fact it is so, is 0.05. 
If the hypothesis is that p < po, the probability of unjustly rejecting it is 
not greater than 0.05. "" 

Again, suppose that two samples of 8 give means of xi and If the true 
population means are pi and p 2 , and if the hypothesis Ho is that pa — pi < 0, 
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then to establish the alternative hypothesis Hi that 1 x 2 > mi it will be necessary 
to reject ilo- If we want the risk of error in so doing to be not greater than 
0.01, we must take a = 0.01. The difference A = /zg — jui will be estimated by 
the difference of the means X 2 -* Xi, with a standard error of o-( 2 / 8)^^2 = 

(X being the standard deviation of the populations concerned (assumed the 
same for both). The number of degrees of freedom for the estimation of <j is 
14, and this is the number that must be used for n in entering the table. 
For n = 14, a = 0.01 and = 0,2, we find that p == 3.51. Now p is the ratio 
of A to its standard error and so is here to be taken as 2A/cr. Hence 
A/or = 1.75. That is, a difference in yields amounting to 1.75 times the stan- 
dard deviation has a reasonable chance (0.8) of being detected at the 0.01 level 
of significance (when the probability of unjustly rejecting JSTo is not greater 
than 0.01). 

12.18 The Power Function of Analysis of Variance Tests. Suppose that h 
“treatments'^ are to be compared, each being replicated N times, and that 
it is desired to test whether any significant difference exists between these 
treatments, as expressed in the value of some variate X which will be called 
the “yield.^^ 

The usual procedure is to compute Snedecor^s F and to reject the null 
hypothesis if the observed F is greater than the tabulated Fa corresponding to 
an assigned level of significance a. In tabulating the power function of this 
test, P. C. Tang found it convenient to use = }iF / (/2 + /iF), where 
/i = number of degrees of freedom for treatment mean square = & — 1, and 
ft = number of degrees of freedom for error mean square = — 1). 

If qi and qt are the treatment sum of squares and the error sum of squares 
respectively, it is easily seen from the definition of F that 
(12.59) F2 = qiiqi -f 

It may be noted that the squared correlation ratio given in (11.1) is a special 
case, the treatment means being replaced by means of arrays. 

If the true effect of the ith treatment is and if the origin is so chosen that 
= 0, the standard deviation of the treatment 'effects is 



If (T is an estimate of the standard deviation of individual yields the standard 
error of a treatment mean is = (Tm- The ratio 

(12.60)' <l> = (^t/ctm 

is used in both Tang’s and Lehmer’s tables as an argument. Tang has shown 
that when <t) is not zero the frequency function for qi/a^ (denoted by is 

a2.61) /(X'=) - 

where X = lc4>^/2. 
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This is known as the non-central y} distribution. When 0, it reduces 
to the ordinary distribution with k degrees of freedom. The quantity 
qi/q 2 is the ratio of a non-central variate to an ordinary variate, and from 
its distribution that of is obtained. The result is 

(12.62) f(E^ 1 X) = ^ 1 / 2 )]^^ 

where H(\E^) is the confluent hypergeometric series defined in (11.8). For 
X = 0, is a Beta-variate. The null hypothesis Ho is that = 0, that is, 
X = 0. The alternative hypothesis Hi is that X is not 0. For a given sig- 
nificance level a (the probability of error of the first kind), the critical value 
EJ is given by 

(12.63) r f(E^\\==0)dE^ 

The probability of error of the second kind is then 

X jg* j 

“/(H2 1 X) dE^ 

and can be calculated numerically for a given Ea^. 

The special case when A = 2 was considered in the previous section. 
Then = ^Qiq — /xi) and 0:2 = |(mi — Mo), so that o-ar = A/2. Hence 
^ == A/2(r = p/V2, where p is the argument used by Neyman and 

Tokarska. However, Tangos tables are for the symmetric two-tailed test, 
whereas Neyman^s are for the asymmetric one-tailed test, so that Tang's level 
a = 0.01 (with fi = 1) corresponds to Neyman's level oc = 0.005. 

Example 14. If the estimated value of <r is 10, and if the true effects for 4 treatments, in 
5 replications each, are —5, —4, 3, 6, we have 



w - 10/51^2 4.47 

<f> - 1.04, /i -3,/2 = 16 

Taking a = 0.05, we find from the tables that /S is about 0.7. That is, in about 3 experi- 
ments in 10 we should find the suggested combmation of treatments significant at the 5% 
level. Except for h —2, there are, of course, infinitely many sets of a's which would give 
the same value of The tables do, however, permit one to estimate what order of mag- 
nitude of treatment effects would be expected to show up in an experiment of a given design. 

Lehmer's tables,^* referred to above, give directly the value of required for a given power, 
or, what amoimts to the same thing, for a given probability of error of the second kind. The 
table for jfc =2 corresponds to the two-tailed <-test. 

12.19 Sequential Tests of Hypotheses. Suppose that under a hypothesis 
Ho the frequency function for a variable x isfoix) and that under an alternative 
hypothesis Hi it is fi(x). »If xi, X 2 , • * • x» are the observed values of x in a 
sample of n, the probability of this sample under Ho is 

(12.65) po = /o(a:i)/o(x2) • • •/o(xn) 
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Under JTi the probability is 

(12.66) Pi = fi(xi)fi(x2) • • -Mxn) 

The ratio of these probabilities pi/po may be taken as a measure of our belief 
in Hi rather than in Ho. We may agree to accept Hi if pi/po ^ A and to 
accept Fo if Pi/po < Bj where A and B are reasonable limits, arbitrarily 
assigned. Writing Li = log pi, etc., and Zn = Li — Lo, we can express these 
alternatives as follows: 

Accept Hi if Zn > a; 

Accept Ho if Zn < h. 

Clearly, as far as the second alternative is concerned, this is the same principle 
as that used by Neyman and Pearson in defining a region of acceptance for Ho 
(see Example 13 in § 12.16). 

However if 6 < Zn < a, neither of the alternatives can be accepted and 
the choice must remain in doubt. In the sequential probability-ratio test, the 
criterion of the size of Zn is applied as the sample is accumulated^ an item at a 
time, sampling being continued and Zn recalculated until finally either Ho 
or Hi is accepted. Instead of being fixed, the sample size n is now a stochastic 
variable. The chief practical merit of sequential testing is that in ordinary 
situations a considerably smaller sample size is required on the average to 
achieve the same degree of confidence in accepting Ho (or Hi) than is required 
with the customary test procedure. In routine testing it is not always con- 
venient to take the sample in this piecemeal fashion, one item at a time, 
between tests, but where the method is applicable it is decidedly economical. 

It may happen, of course, that the test will go on for a long time before a 
decision is reached. Sooner or later, however, either Ho ov Hi must be ac- 
cepted. The probability that 6 < Zn < a for all n is zero. This was shown 
by A. Wald in an important paper which laid down the basic ideas of se- 
quential testing. 

Now the probability a of error of the first kind is the probability of reject- 
ing Ho when Ho is true. If ^ is the probability of error of the second kind, 
1 — /3 is the probability of rejecting Ho when Hi is tine. Hence (1 — fi) /a 
is the ratio of the probabilities of Hi and Ho for a sample which leads to the 
rejection of Hq. That is, (1 — jS)/a = pi/po > A, since it is only when 
Pi/Po > A that Ho is rejected. Simiiarly /3/(l — a) < B. 

For practical purposes (at least when the sample size is greater than 20) 
these inequalities can be replaced by equations, since as a rule one more 
observation will make little difference to Zn. When the bound a or 6 is finally 
overstepped it will not be by very much. Wald has given limits to the error 
involved in this assumption. 

We, therefore, in practice choose A — (1 0)/a and B = p/(l — a) 

where a and p are the errors of the two kinds that we are prepared to 
tolerate. 
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12.20 Expected Number of Observations in a Sequential Probability-Ratio 
Test, Let n be the smallest integer for which either Zn> a ot Zn < b. We 
require to find Eo{n) and Eiin), the expected values of n on the two hypotheses 
Hq and Hi respectively. 

If Hois true, the probability of accepting jETo is 1 — a and the probability 
of accepting Hi is a. The second case corresponds approximately to 
Zn = a log 4) and the first to = 5(= log 5). The expected value 
of Zn is, therefore, 

(12.67) Eo{Zn) = aa + (1 ~ a)b 


n 

NowZn = log (pi/po) = whereat = log/i(a:») ~ log/o(xt). Since the 

are independent random variables with the same distribution and the number 
n is also a random variable, 


so that 

(12.68) 


-E?o( 2 fn) = B(i(n)Eo(z) 


Eo(n) = 


gq “b (1 <x)b 

E,iz) 


Similarly, if Hi is true, the probabilities of accepting Hq and Hi are and 
1 — respectively. Hence 


(12.69) 
and 

(12.70) 


EiiZn) = (1 - fi)a + pb 


Ei(n) = 


(1 - P)a + fib 
Ei{z) 


If the distribution of the variate x is known, Eq{z) and Eiiz) can be cal- 
culated. Thus if a; is normally distributed with variance about a mean 
which is either /xo or ^tti, we can take if o as the hypothesis = /io and Hi as 
the hypothesis m = Mi* Then 


and 


crV 2ir 
<rV27r 

Z<cr2 _ 


The expectation of Zi on hypothesis Ha is 

Ho(Zx) dxi - • • dXn 

= - (Mi-Mo)V2<r* 

and similarly Si (z,) = (mi — Hence, from (12.68) and (12.70), 
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(12.71) 


Eoin) = 


Co<T» 

— Mo)^’ 


Eiin) = 


(w - Mo)^ 


where Co and Ci depend only on a and j3. It therefore appears, as is intuitively 
obvious, that the smaller the difference between no and ni> tbe more observa- 
tions will be required to discriminate between them at a fixed level of sig- 
nificance. 

It is of interest to compare the values in (12.71) with the fi^ed size of sample 
N necessary to discriminate between no and ni by the ordinary test procedure 
with the same size of errors. This procedure consists in accepting Ho if 
X < \ and accepting Hiii x > \,\ being a suitably chosen constant. On the 
hypothesis Ho, 5 is normally distributed about no ’«dth variance cr^/N and the 
probability of accepting Hi is therefore 


a = (2x)-i/2 je-m dt = 1 - 4>(U 
Jta 

where — iuo)/V. 

Similarly the probability of an error of the second kind is 16 = where 
Hence 


(12.72) 



From (12.71) and (12.72) we see that the ratios EQ{n)/N and Ei{n)/N are 
independent of mi? mo and and so may be calculated for any assigned a and 
If, for example, a = = 0.05, we find that =~ = 1.645, and i/o(n)/iV == 

Ei{n)/N — 0.49. There is, in this case, an expected saving of about 50% in 
the number of observations required. 

Although, as stated above, the sequential process must ultimately terminate 
in a decision one way or the other, it may not be practicable to continue testing 
beyond a sample size of, say, n\ If the issue is still undecided at this stage, 
we can formulate a reasonable rule as follows: 


Accept Ho if < 0; 

Accept Hi if Zn* > 0. 

The probabilities of the two kinds of error, say a{n') and are somewhat 
different from a and p in this truncated test It may be shown that 

o^(n') < a + <i(p2) “ ^(*^i) 

where 

a — n'Eo(z) nHo(z) 

V2 = 7= » Vi ;=r - 

Vn^aoiz) Vn'vo{z) 

with a similar expression for On the assumption of a normal popula- 

tion, Eo{z) =— (jLti — fioy/2cr^ and ao{z) ~ (mi “ mo)V<^^* If « ^ii-d ^ are 
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chosen as 0.05 each, if N in (12.72) is taken as 100, and if n' = 200, we find 
a{n') = i0(nO < 0.058. Hence in a sequential test designed to discriminate 
with the same degree of accuracy as a fixed-sample test of size 100, we do not 
seriously increase the risk of error by stopping at 200 items, even though the 
test is still indecisive. The chance of our having to continue so far is, of 
course, quite small. 

12.21 Test of a Hypothesis Against a One-sided Alternative. Let Hq be, 
the simple hypothesis that a parameter 6 is equal to and let the alternative 
one-sided hypothesis Hi be that d > Bo. 

The probability of error of the first kind may be arbitrarily chosen as a, but 
the probability of error of the second kind depends on the true value of 6. 
We can, however, take a value of 6% > Bo and construct a sequential test for 
Ho, against the single alternative hypothesis that & = Bi, with an assigned 
probability jS jof error of the second kind. If this test is such that the error 
of the second kind is less than (or equal to) jS when B > Bi (so that the power 
curve rises as B increases beyond Bi), the test may be used for Ho against any 
alternative hypothesis 6 > Bq. Although the chance of error of the second kind 
may be high when B is near Bo tliis error will not matter very much. It will 
merely mean accepting for 6 a close approximation Bq. 

In most important practical cases the probability of error of the second kind 
does decrease as 0 increases. Since, for such cases, it is also true that the 
probability of error of the first kind is equal to or less than a whenever 6 < Bo, 
the same test can be used for the composite hypothesis B < Bom against the 
alternative hypothesis B > Bo- 

12.22 The Sequential Test for a Binomial Distribution. One important 
example is provided by the simple classification of manufactured test objects 
into two classes, say ^^defective^^ and '^satisfactory.'^ Let us suppose that a 
lot will be accepted if the proportion of defectives p < p' but will be rejected 
if p > p\ We fix arbitrarily two values, iro < p' and wi > p', such that the 
consequences of rejecting the lot will be serious if the true value of p is less 
than To and such that the consequences of accepting the lot will also be serious 
if p > Ti. (The consequences of unjustly rejecting a good lot are, of course, 
trouble and expense to the manufacturer. The consequences of unjustly 
accepting a bad lot will be felt by the customer who buys the lot, but will 
ultimately injure the manufacturer's reputation.) We decide also on the 
risks a and 0 we are prepared to run of committing these serious errcnrs, and 
then are able to construct a sequential test. 

Since the distribution of the number of defectives in the first n units 
inspected is binomial, the probabilities of the observed sample on the hy- 
potheses p == Ti and p = To are 

Pi^rr tMI - 

and 

po * 
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whence 

(12.73) log + (n - dn) log ([ 3 ^) 

The lot is rejected if Zn > a and accepted if Zn < h, where a == log [(1 — /a] 

and b = log [/3/ (1 — a)]. If neither is true, the test is continued. This test 
is equivalent to setting up acceptance numbers An and rejection numbers i?„ 
for each value of n, and continuing testing as long as < dn < Rn- The 
numbers An and Rn are given by substituting for Zn in (12.73). They are 



Since and Rn depend linearly on n, they define a sloping band of constant 
width in the n -- dn plane. See Figure 44. 

The figure is drawn for a = /? = 0.06, n = 0.03, to = 0.001. The lines 
defining the band are 

An=-- 0.858 + 0.00859n 
Rn = 0.858 + 0.00859n 

The lot cannot, therefore, be accepted 
until 100 random samples have shown 
no defectives. If 1 defective appears 
in the first 17, or if 2 appear in the 
first 132, the lot is rejected, and so on. 

Let Lp be the probability of accept- 
ing the lot for any given p. The curve 
of Lp against p is called the operating 
characteristic curve (OC for short) of the 
test. Since the acceptance of the lot when p is large is an error of the second 
kind, the OC curve is in effect a power curve for the test, 1 — Lp corresponding 
to the power. Lp decreases as p increases, from 1 when p = 0 to 0 when p = 1, 
If p = TTo, Lp = 1 — a, and if p = ti, Lp = 0, Wald has shown that if tq 
and Ti are not too far apart (so that the expected value and the variance of z 
are small) then Lp is approximately given by 



(12.75) 


A ^-1 


where h is the non-zero root of the equation 
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By choosing various values of h, substituting in (12.75) to get Lp and solving 
(12.76) for p, the OC curve can be constructed. The probability of accepting 
a lot with a given proportion of defectives can then be read off. 

The expected number of observations required to reach a decision can be 
calculated in terms of Lp and p. It is given approximately by 


(12.77) 


Ep(n) = 


hLp + a(l — Lp) 

p log — + (1 - p) log i 

Tq 1 


_£l 

TTo 


With the data assumed in Figure 44, this number is 53 for p = 0.02 and 36 
for p = 0.03. . 

12.23 Toletance Limits for a Parent Pdptilation. Instead of fixing confi- 
dence limits for a parameter of a parent population, it is sometimes convenient 
to estimate the limits between which a si>ecified proportion of the population 
may be expected, with a specified degree of confidence, to lie. Such limits are 
called tolerance limits. 

We may, for instance, wish to know what size of sample to take in order to 
be confident (with confidence coefficient 1 — a) that a fraction at least ^ of 
the population will have a value of x lying within the sample range. 

Assuming that the Random variable x is continuous with frequency function 
f(x), we have for the joint probability of the smallest sample value Xi and the 
largest value xn, as in § 7.19, 

dP = N(N -- dxidxN 

Let 

rxN rxi 

V = I f(x) dXj u = I f(x) dx 
Jxi J-oo 

Then v is the proportipn of the population lying between Xi and that is, 
within the sample range. 

By differentiating under the integral sign, we see that 

j (jiiJL] = I /(®i) 0 

I — /(Xi) /(Xjv) 

='fiXl)f(XN) 

Hence du dv = f(xi)fixii) dxi dxN so that the joint probability for u and v is 
dP = NiN — du dv. The probability for v is given by integrating 

over u. Now 0 < w <1 — p, so that the range of integration for u is from 
0 to 1 — p. The frequency function for v is, therefore, 

(12.78) f(v) = N(N - l)t^-*(l -v)dv, 0<v<l 
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The probability that v> pis, then, 


X 


/(«) dv = 1 


a 


since 1 — a is our assumed confidence coefficient, so that 


(12.79) - (N - = a 

from which N can be obtained. Thus, for a = 0.05 and (3 = 0.99 we find 
N == 473. The probability is 0.95 that at least 99% of a parent population 
(continuously distributed) will lie between the least and greatest values found 
in a sample of 473. These limits are independent of the form of the parent 
distribution. 

A somewhat more difficult problem arises when we wish to take account of 
the fact that the frequency function for the parent population depends on a 
parameter 6. This problem has been discussed by S. S. Wilks. lif(Xyd) is 
the known frequency function for the parent population, then for any given 
d between 0 and 3 , we can determine Xi and X 2 as functions of 5 and B so that 


/ Xi rcc 

fix, d)dx ^ I fix, 6) dx = ~ 

00 J\2 

If B is the maximum likelihood estimate of B from a sample of N and if 


(12.81) V = 

where Xi = Xi(^, 5) and similarly ^for X 2 , then v is the proportion of the parent 
population lying between Xi and X 2 . Since B tends in the stochastic sense to B 
as N increases, v tends stochastically to the value 
Now let iS be a number between 0 and 5. If the distribution of v is inde- 
pendent of B, then, for N sufficiently large, the probability that v > /3 will be 
greater than 1 — a, for any given a between 0 and 1, whatever the value of B. 
Usually and 1 — a will both be near 1, and there will be a smallest value of N 
for which this probability is practically equal to 1 — a.. The values of Xi and 
X 2 , calculated for a sample of this size, are called 100^% parameter-free 
tolerance limits at the significance level a. 


Example 15. A sample of 20 is drawn from a normal population of mean jx and variance 
cr2. It is required to establish tolerance limits for x. 

The parameters /i and are unknown but may be estimated from the sample mean and 
variance If x and s* are the sample mean and the variance, it is natural to take the toler- 
ance limits as ^ + ks, where kis a constant, h is determined by the condition that if 

(12.82) V = 7 =r I dx ^ 

orV27r 

* 

then the probability that v > 0 has a specified value 1 — a. The exact distribution of v is 
very complicated. The expectation of v can be found, using the joint distribution of x and 
fi worked out in § 7.6, namely. 
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(12.83) 

We have that 

(12.84) 


. ( W/27rcrg)^/g2(iV/2o-g) 

_ l)/2j 


0 2«r2 


^iV-2 0-iV'«8/2<^ 


E(p) =f f vf{x,s)dsd& 
J-oo Jo 


where v is given by (12.82). The integral may be simplified by means of suitable changes of 
variable, and may be expressed as 


(12.85) ^ 


E(v) 


(N 



dt 

t2 yV /2 

N -IJ 


where = hliN — 1) /{N + 1)]^^*. Since the integrand in (12.85) is the frequency function of 
Student’s E{v) is equal to 1 — P, where P is the probability in Fisher’s table for a given i'. 
The tolerance limits which include m the average 95% of the values of x are given by 
i' = 2.093 for iV ~ 20 (19 degrees of freedom) so that h = (fi)^^^ f = 2.200. The limits 
are, therefore, 5* ± 2.2005. 

This, however, does not establish the extent of sampling fiuctuation of these limits. The 
variance of v for large samples is given by 
(12.86) 0-.2 = e“'7(7riVl 


as far as terms of order 1/N, 

As an approximation to the distribution of v we can try fitting a Pearson Type I function, 

(12.87) Kv) - t^’Kl - r)^“‘VB(a, h) 

determining a and 5 by equating the mean and variance of this distribution to 1 — P and 
respectively. The values of a and h so found are 

(12.88) o = P(1 - P)V<r.2 - (1 ~ P) 

h - PHI - P)/ir,* ~ P 


The probability that v > ^ can then be read from Pearson’s table of the Incomplete Beta 
function. The distribution is so skew, however, that the value of a may well turn out be- 
yond the range of the table, and recourse must then be had to calculation of the integral by 
quadrature. 


In the example above, if P=0.05 and f'=2.093, we find that 0.000873, 
a = 50.76, h “ 2.67. Since the range of v is from 0 to 1 and its expectation 
is 0.95, it is evident that the distribution is very strongly skewed, with a long 
tail to the left. The tables go as far as a = 50, for h = 2.5 and 6 = 3.0. 
A rough interpolation and extrapolation gives a value of = 0.893 corre- 
sponding to a = 0.05. The limits calculated, namely, x ± 2.200s, are, 
therefore, 89% parameter-free tolerance limits at the significance level 0.06. 

Extensive tables of tolerance limits prepared by A. H. Bowker for a normal 
distribution are given in Chapter II of Techniques of Statistical Analysis, by 
Eisenhart, Hastay and Wallis (McGraw-Hill, 1947). These tables give the 
factors h for selected values of y and P such that the probability is y that at 
least a proportion ^ of the distribution will be included between x ± hs. All 
sample sizes from 2 to 200 and others at intervals between 200 and 1000 are 
included. For iV* = 20, 'f = 0.95, and ^ = 0.90, the value of k is 2,286. 
The 90% tolerance limits with confidence coefficient 0.95 are, therefore, 
^ ± 2.286s. 
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Problems 


1 . Prove that the simultaneous maximum likelihood estimates of the five parameters 
fti, Ms, vi, and p of a bivariate normal distribution with frequency function 


y) = 


I J 1 r/ g-Mi Y 2p(g-Mi)(y-M.) 

2iro-icr2(l — p2)i/2 2(1— p*)L\ vi / Viff2 



are the sample means, the sample variances and the sample covariance respectively. 

2. Obtain the maximum likelihood estimates of the parameter <r for fixed fx and X and of the 
parameter X for fixed p, and o-, for the Type III curve of equation (12.25), (Use the likelihood 
function given in (12.26).) 

3. Find the efficiencies of the moment estimates corresponding to the estimates obtained 
in Problem 2. 

4. Prove that the maximum likelihood estimate of the mean of a parent population of 
Poisson Type (with parameter fx) is equal to the sample mean x and that the variance of this 
estimate is 

6. Using a table of random sampling numbers, write down 40 pairs of two-digit numbers. 
Regard any pair, say 78, 16, as being an independent sample of Xi and X 2 from a rectangular 
distribution on the range 0 to 0, as in Example 9, § 12.10. Calculate for each sample the two 
sets of confidence limits (a) f (xi + x^) and 2{xi + X 2 ) and (b) L and 2L. Verify that for 
both {a) and (6) about 3/4 of the confidence intervals mclude the true value 0 ~ 100. 

6. Suppose that in a certain population the probability that an individual is “defective’^ 
is either 0.1 or 0.3, but cannot have any other value, and that we wish to test the hypothesis 
Rro(ir = 0.1) against the alternative hypothesis Hiir = 0.3). If dn is the number of de- 
fectives in a sample of n, and if we agree to accept H^'^dn <h and to accept Ei otherwise, 
show that m order to have the risks of error of the first and second kind about 0.02 and 0.03 
respectively, we should take n = 55 and k = 10. 

7. Construct a sequential acceptance-and-rejection chart, similar to that in Figure 44, for 
the situation described in Problem 6. Take po =0.1 and pi == 0.3. 

8. Perform an imaginary sampling experiment from the population of Problems 6 and 7 
by reading a set of one-digit random numbeia and regarding each number as a sample item. 
For hypothesis Hq take 0 as indicating a “ defective, for hypothesis Hi take 0, 1 and 2 as 
defectives. In 20 trials on each hypothesis count the number of sampkjs necessary to reach 
a decision, using the chart constructed in Problem 7. 

Calculate the approximate expected value of this number on each hypothesis and compare 
with the average number found in these trials. 

9. Assuming that the proportion of defectives can vary in the population from 0 to 1, but 

that the acceptance limits are pa —0.1 and pi = 0.3, construct the.operating characteristic 
curve of the binomial sequential test with a = 0.02, = 0.03. 
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Table I. Oedinates and Akeas of the Normal Cl eve, ^(0 ~ 


V2t 


^pn 


t 

0(0 


t 



t 

4 >(i) 


00 

39894 
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17364 
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31594 
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00399 
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35889 
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. 18082 

.92 

26129 

32121 

03 

39876 

01197 

.48 

.35663 

18439 

.93 

25888 

.32381 

04 
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01595 
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05 
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01994 
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.06 
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51 
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33147 

.07 
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02790 

52 
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19847 

97 
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33398 

.08 
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03188 

53 

34667 

.20194 

.98 

.24681 

33646 

.09 

39733 

03586 

54 

34482 

20540 
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03983 

.55 
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20884 

1.00 
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.34134 

.11 
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04380 

56 
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21226 

1 01 

23955 

34375 
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39608 

,04776 

57 
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.21566 

1 02 
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.34614 

.13 

39559 

05172 

.58 
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.21904 
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.23471 
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39505 

05567 

.59 
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.17 
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63 
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64 
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.39024 

08317 

66 
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.36650 

.22 

.38940 

08706 

67 
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Table I, 


Ordinates and Areas of the Normal Chrve, 4 >( t ) = 
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Table I. Obdinatbs and Akbas of the Normal Curve, 0(0 = — i=e-‘’/2 
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Table II.* 5% (Roman Type) and 1% (Bold Face Type) Points for the Distribution of F 
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Table III. Values of x® Corkesponding to Given Probabilities 


Degrees 

of 


Probability of a deviation greater than x" 


freedom 

n 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 


01 


6 635 
9 210 
11 341 
13 277 
.15 086 

16.812 
18 475 

20 090 

21 666 

23 209 

24 725 

26 217 

27 6SS 

29 141 

30 578 

32 000 

33 409 

34 805 
36 191 
37.566 

38 932 

40 289 

41 638 

42 980 
44.314 

45 642 

46 963 

48 278 

49 588 
50.892 


,02 


5 412 
7 824 
9 837 
11.668 
13 388 

15 033 

16 622 
IS 168 
19 679 
21 161 

22 618 
24 054 
25.472 
26 873 

28 259 

29 633 

30 995 

32 316 

33 687 
35.020 

36 343 
37.659 
38 968 
ip 270 : 
41.566 

42 856 

44 140 

45 419 

46 693 

47 962 


.05 


3 841 
5 991 
7.815 
: 9 488 

11 070 

12 592 

14 067 

15 507 

16 919 

18 307 

19 675 
21 026 

22 362 

23 685 
24.996 

26 296 

27 587 

28 869 

30 144 

31 410 

32 671 

33 924 

35 172 

36 415 
37.652 

38 885 

40 113 

41 337 
42.557 
43.773 


.10 


2.706 
4 605 
6.251 
7.779 
9 236 

10 645 

12 017 

13 362 

14 684 

15 987 

17.275 

18.549 

19.812 

21.064 

22 307 

23 542 

24 769 

25 989 
27.204 
28.412 

29 615 

30 813 
32 007 
33.196 
34 382 

35.563 
36 741 
37,916 
39 087 
40.256 


.20 


1 642 

3 219 

4 642 

5 989 
7.289 

8 558 

9 803 

11 030 

12 242 
13.442 

14 631 

15.812 
16.985 
18.151 

19 311 

20 465 

21 615 

22 760 

23 900 
25.038 

26 171 

27 301 

28 429 

29 553 
30.675 

31.795 
32 912 

34 027 

35 139 
36.250 


.30 


1 074 

2 408 

3 665 

4 878 
6.064 

7 231 

8 383 

9 524 
10.656 
11.781 

12 899 

14 on 

15.119 
16 222 
17 322 

18.418 

19 511 

20 601 
21 689 
22,775 

23 85S 
24.939 
26.018 
27 096 
28.172 

29.246 
30.319 
31.391 
32 461 
33.530 


.50 


.455 

1.386 

2.366 

3 357 

4 351 

5.348 
6 346 
7.344 
8.343 
9 342 

10 341 

11 340 

12 340 

13 339 
14.339 

15 338 

16 338 
17.338 

18 338 

19 337 

20 337 

21 337 

22 337 

23 337 
24.337 

25 336 

26.336 

27.336 

28 336 

29 336 


For larger values of < the quantity (2x^)^ — (2n^ — 1)^ may be used as a 
normal deviate with unit^standard deviation. 


♦This table is reproduced from#'* Statistical Methods for Research Workers/' with the 
generous permission of the author, Professor il. A» Fisher, and the puoUfhers, Messrs. 
Oliver and Boyd. 
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’able III Values of x" Corresponding to Given Probabilities [coni,) 


Degrees Probability of a deviation greater than 

of 


freedom 

n' 

70 

80 

90 

95 

98 

99 

1 

148 

0642 

0158 

00393 

000628 

000157 

2 

.713 

446 

211 

103 

.0404 

0201 

3 

1.424 

1 005 

584 

352 

.185 

, .115 

4 

2 195 

1 649 

1 064 

711 

.429 

i 297 

5 

1 

3 000 , 

2 313 

1 610 

1 145 

.752 

1 '..554 

6 

3 828 

3 070 

2 204 

1 635 

1 134 

.872 

7 

4 671 

3 822 

2 833 

2 167 

1 564 

1 239 

8 

5 527 

4 594 

i 3 490 

2 733 

2 032 

1 646 

9 

6 393 

5 380 

4 168 

3 325 

i 2 532 

2.088 

10 

7 267 

6 179 

4 865 

3 940 

3 059 

2 558 

11 

8 148 

6 989 

I 5 578 

4 575 

3 609 

3 053 

12 

9.034 

7.807 

6 304 

5 226 

[ 4 178 

3 571 

13 

9 926 

8 634 

7 042 

5 892 

4 765 

4 107 

14 

10 821 

9 467 

7.790 

6 571 

5 368 

4 660 

15 

11 721 

10 307 

8 547 

7.261 

: 5 985 

5 229 

16 

12 624 

11 152 

9.312 

7 962 

6 614 

5 812 

17 

13 531 

12 002 

10 085 

8 672 

7 255 

6 408 

18 

14 440 

12 857 

10 865 

9 390 

7.906 

7 015 

19 

15 352 

13 716 

11 651 

10 117 

8.567 

7 633 

20 

i 16.266 

I 

14 578 

j 

12.443 

10.851 

9 237 

8 260 

21 

17 182 

15 445 

13 240 

11 591 

9 915 

8 897 

22 

18 101 

, 16 314 

14 041 

12 338 

10 600 

9 542 

23 

19 021 

17 187 

14 848 

13 091 

11 293 

10 196 

24 

19 943 

18 062 

15 659 

13 848 

11.992 

10 856 

25 

; 20.867 

18 940 

16.473 

14 611 

12 697 

11.524 

26 

21.792 

19 820 

17 292 

15 379 

13 409 

12.198 

27 

22.719 

20 703 

18.114 

16.151 

14.125 

12 879 

28 

23 647 

21 588 

18.939 

16 928 

14.847 

13 565 

29 

24 577 

22 475 

19 768 

17 708 i 

15 574 

14 256 

30 

25 508 

23.364 

20 599 

18 493 

16 306 

14 953 


For larger values of the quantity (2x^)^^ — (2n' — 1)^ may be used as 
a normal deviate with unit standard deviation. 
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Table IV. Values of t Corresponding to Given Probabilities 


Probability of a deviation greater than t 


01 

freedom n 

,005 

01 

025 

.05 

.1 

.15 

1 

63 657 

31.821 

12 706 

6 314 

3 078 

1 963 

2 

9 925 

6 965 

4.303 

2 920 

1 886 

1 386 

3 

5 841 

4 541 

3.182 

2 353 

1 638 

1 250 

4 

4 604 

3 747 

2.776 

2 132 

1 533 

1.190 

5 

4 032 

3 365 

2.571 

2 015 

1 476 

1 156 

6 

3 707 

3 143 

2.447 

1 943 

1 440 

1 134 

7 

3 499 

2 998 

2.365 

1.895 

1 415 

1 119 

8 

3 355 

2 896 

2 306 

1 860 

1 397 

1 108 

0 

3 250 

2 821 

2.262 

1 833 

1 383 

1 100 

10 

3.169 

2.764 

2.228 

1 812 

1 372 

1 093 

11 

3.106 

2 718 

2 201 

1 796 

1 363 

1 088 

12 

3.055 

2 681 

2.179 

1 782 

1 356 

1 083 

13 

3 012 

2 650 

2.160 

1 771 

1 350 

1 079 

14 

2 977 

2 624 

2 145 1 

1 761 

1 345 

1 076 

15 

2.947 

2 602 

2.131 

1 753 

1 341 

1 074 

16 

2 921 

2 583 

2.120 

1 746 

1 337 

1 071 

17 

2 898 

2 567 

2 no 

1.740 

1 333 

1 069 

18 

2 878 

2.552 

2 101 

1 734 

1 330 

1 067 

19 

2.861 

2 539 

2 093 

1 729 

1 328 

1 066 

20 

2.845 

2 528 

2.086 

1 725 

1 325 

■ 1 064 

21 

2 831 

2 518 

2 080 

1 721 

1 323 

1 063 

22 

2.819 

2 508 

2 074 

1 717 

1 321 

1 061 

23 

2 807 

2 500 

2 069 

1 714 

1 319 

1 060 

24 

2.797 

2.492 

2 064 

1 711 

1 318 

1 059 

25 

2 . 787 ’ 

2 485 

2 060 

1 708 

1.316 

1 058 

26 

2.779 

2 479 

2 056 

1 706 

1 315 

1 058 

27 

2 771 

2.473 

2 052 

1.703 

1 314 

1 057 

28 

2 763 

2 467 

2 048 

1.701 

1 313 

1.056 

29 

2.756 

2 462 

2 045 

1 699 

1 311 

1 055 

30 

2.750 

2.457 

2.042 

1 697 

1 310 

1.055 

oo 

2.576 

2 326 

1.960 

1 645 

1 282 

1.036 


The probability o? a deviation numerically greater than t is twice the 
probability given at the head of the table. 

♦ This table is reproduced from “Statistical Methods for Research Workers,” with the 
generous permission of the author, Professor R. A. Fisher, and the publishers, Messrs. 
Oliver ana Boy4. 
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Table IV. Values of t Coerespondinq to Given Probabilities { cont ,) 


Degrees 

of 

Probability of a deviation greater than t 

freedom m 


.25 

.3 

.35 

.4 

.45 

1 

1 376 

1 000 

.727 

.510 

.325 

.158 

2 

1 061 

816 

.617 

.445 

.289 

.142 

3 

.978 

.765 

.584 

.424 

.277 

.137 

4 

.941 

.741 

.569 

.414 

.271 

.134 

6 

.920 

.727 

.559 

.408 

.267 

.132 

6 

.906 

.718 

553 

.404 

.265 

.m 

7 

896 

.711 

549 

.402 

.263 

■SI 

8 

.889 

.706 

.546 

.399 

.262 

■11 

9 

.883 

.703 

.543 

.398 

.261 

.129 

10 

.879 

.700 

.542 

.397 

.260 

.129 

11 

.*876 

.697 

.540 

.396 

.260 

.129 

12 

.873 

.695 

.539 

.395 

.259 

.128 

13 

.870 

.694 

.538 

.394 

.259 

.128 

14 

.868 

.692 

.537 

.393 

.258 

.128 

15 

.866 

.691 

.536 

.393 

.258 

.128 

16 

,865 

.690 

.535 

.392 

.258 

.128 

17 

.863 

.689 

.534 

.392 

.257 

.128 

18 

.862 

.688 

.534 

.392 

..257 

.127 

19 

.861 

.688 

.533 

.391 

.257 

.127 

20 

.860 

.687 

.533 

.391 

.257 

.127 

21 

.859 

.686 

.532 

.391 

.257 

.127 

22 

.868 

.686 

.532 

.390 

.256 

.127 

23 

.858 

.685 

.532 

.390 

.256 

.127 

24 

.857 

.685 

.531 

390 

.256 

.127 

25 

.856 

.684 

.531 

.390 

.2'56 

.127 

26 

.856 

.684 

.531 

.390 

.256 

.127 

27 

.855 

.684 

.531 

.389 

.256 

.127 

28 

.855 

.683 

.530 

.389 

.256 

.127 

29 

.854 

683 

.530 

.389 

.256 

.127 

30 

1 .854 

:683 

.530 

.389 

.256 

.127 

00 

.842 

.674 

.524 

385 

.253 

.126 


The probability of a deviation numerically greater than t is twice the 
probability given at the head of the table. > 
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INDEX 


Acceptance, region of, 388 
Addition theorem (probabilities), 8 
Additivity (analysis of variance), 247, 
251 

Adjoint, of matrix, 296 
Aitken, A. C., 83, 295, 307, 322 
Allan, F. E., 365 
Analysis of covariance, 274 
Analysis of variance, 238, Chap. IX 
assumptions, 249-252 
in regression, 325 
one-way classification, 238, 260 
power of test, 395 
three-way classification, 241, 262 
two-way classification, 246, 262 
Anderson, R. L., 329, 365 
Angular transformation, 253 
Association (between attributes), 227 
Asymptotic series, 59 
Asymptotically normal distributicn, 
132, 133 

most powerful tests, 391* 

Attributes, sampling of, 38 
association of, 227 

Banachiewicz, T., 322 
Barnard, G. A., 230, 233, 237 
Bartlett, M. S., 179, 198, 253, 256, 287 
Bayes, T., 15, 385 
assumption, 15, 17 
rule (theorem), 14, 15, 17, 18, 21, 129, 
130, 258 

rule for future events, 18 
Behrens-Fisher test, 257, 259, 264, 385 
Bernoulli, J,, 22 

distribution (see Binomial distribu- 
tion) 

numbers, 81, 82 
theorem, 41, 85 
Bertrand, J., 14, 53 
Bessel, F. W., correction, 161 
formula, 186 
function, 363 
Beta function, 61, 62 
incomplete, 64, 224 
tables, 64, 66, 324 


Beta variate, 95, 96, 101, 104, 184 
Beta prime variate, 96, 97, 101 
Bienayme-Tchebychefi inequality, 843 
85, 371 

sharper form, 86 
Bilinear form, 294 
Binomial distribution, 22, Chap. II 
charts, 148, 159 
confidence limits, 147, 149 
graphical representation, 23 
mode, 32 

moment generating function, 73 
moments, 29, 30 
normal approximation, 33, 34 
special case of multinomial, 113 
sum of binomial variates, 79 
tables of, 23, 148 
transformation of, 252 
Binomial theorem, 6, 7 
Birge, R. T., 321, 322 
Bivariate normal distribution, 92 
Bliss, C. I., 118, 126 
Blocks, 242 
complete, 278, 281 
incomplete, 281, 282 
randomized, 278 
Bose, R. C., 278, 288, 362, 366 
Bowker, A. H., 404 
Brandt, A. E., 229 
Bross, I. J., 288 

Buff on, G. L. (Comte de), 14, 123 

Camp, B. H., 51, 52, 159, 236 
method for tetrachoric r, 207, 236 
Carver, H.. C., 143, 144, 159 
Cauchy distribution, 28, 72, 73, 91, 101 
Central limit theorem, 41, 88, 89, 108 
Liapounoff condition, 90 
Lindeberg condition, 89 
with dependent variables, 90 
Chance (see Probability), 13, 84 
Characteristic function, 73, 88, 93 
derivatives, 88 
limit of sequence of, 75 
uniqueness theorem for, 76 
Chebyshev, see Tchebycheff 
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Index 


_ Chi square distribution, 98, 99, 100 

approximations, 99, 126 
Cochranes theorem, 100 
contingency tables, 228, 229, 235 
criterion of goodness of fit, 109, 118, 
377 

cumulants, 98 
effect of pooling, 118 
Fisher^s theorem, 100 
limiting distribution, 114, 116, 378 
minimum, 378 
moments, 98 
non-central, 362, 396 
relation to variance, 167 
relation to z, 181 
tables of, 117, 188 
test of hypotheses, 117 
Church, A. E. R., 159 
Clopper, C. J., 148, 159 
Cochran, W, G., 100, 118, 126, 250, 
260, 278, 287, 288, 406 
theorem, 100 
Collective, 4, 32 

Combination (of probabilities), 233 
Combinations, 5 

Comparative trial (2 X 2 table), 232 
Components of variance, 260, 263 
Condition equations, 318, 319 
Confidence belt, 130, 135 
coefficient, 132, 382 

Confidence interval, 130, 132, 381--387 
asymptotically shortest, 391 
geometrical picture, 383 
most selective, 382 
shortest, 382, 385 
unbiased, 382 

Confidence limits, 132, 134, 382 
binomial distribution, 147, 149, 193, 
194 

correlation coefficient, 220, 223 
difference of means, 186 
difference of parameters, 149 
F-distribution, 183 
mean, 134, 184, 384 
Poisson distribution, 150, 193 
regression constants, 209, 314 
regression estimate, 211 
standard deviation, 488 
variance, 188, 263 ♦ 

variance ratio, 216 
x, given y, 212 


Confidence regions, 391 
Conformable matrices, 291, 292 
Confounding (in design), 282 
partial, 286 

Consistent statistic, 370, 371 
Contingency, tables, 227, 229 
coefficient of mean square, 228 
exact distribution for 2 X 2 tables, 
230 

Convergence, in probability, 4, 85 
mathematical, 3 
of improper integrals, 56 
stochastic, 4, 26, 85 
Convolution, 70, 90 
Correction for continuity, 230 
for inefficient estimate, 380 
Correlation, between errors, 251 
intra-class, 282 
multiple, 339, 345, 358 
normal, 202 
partial, 350 
rank, 224-227 
serial, 151, 157, 354 
tetrachoric, 205 
total, 340 

Correlation coefficient, 69, 92 
average from samples, 223 
between . observed and estimated 
a;, 345 

between statistics, 379 
bias, 223 

confidence limits, 220, 223 
distribution when p = 0, 215 
distribution when p 0, 217 
distribution-free test, 223 
estimate of, 221 
multiple, 348 
partial, 350, 353 
partial with k variates, 353 
Correlation ratio, 323, 324, 395 
Correlation surface, 202 
Correlograms, 354 
Cost of sampling, 159 
Covariance, 68 
analysis of, 274-278 
in linear regression, 309 
matrix of, 310 
of two linear functions, 126 
Cowden, D. J., 332, 364, 365 
Cox, G. M., 260, 278, 287 
Craig, A. T., 30, 54, 129, 162, 197 
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Craig, C. C., 107, 126 
Cramer, G. (Cramer’s rule), 297, 298 
Cramer, H., 5, 21, 75, 76, 93, 118, 119, 
126, 405, 406 
Credibility, 5 

Croxton, F. E., and Cowden, D. J., 188, 
198, 287 
Cumulants, 77 
additive property, 78 
estimates by A;-vStatistics, 109, 110 
Cumulant generating function, 77 
Cumulative frequency function (see 
Distribution function) 

Curtiss, J. H , 93, 287 
Curve fitting, 108 
chi square test, 118 
efficiency, 109 

by maximum likelihood, 376 
by moments, 109, 376 
polynomials, 326 
Czuber, E., 197 

Dantzig, G. B., 393, 406 
David, F. N., 220, 223, 236 
Davis, H. T., 336, 366, 405 
Decimal points (matrix calculation), 
304, 305 

Degrees of freedom, 98, 162, 167 
in analysis of variance, 263 
Delta process (for standard errors), 
141, 197 

De Lury, D. B., 320, 329, 366 
Deming, W. E., and Birge, R. T., 174, 
197 

De Moivre, A. (Laplace theorem), 36, 
39, 40 

Density (probability), 13 
Design of experiments, 278 
Determinants, 83, 295 
alien cofactors, 295 
cofactors, 295 
development of, 295 
functional {see Jacobian) 
minors, 295 
multiplication of, 296 
Diagonal principal, 293 
Dice experiments, 42 
Dichotomy, 22, 205 
double, 233 

Differentiation under integral sign, 56, 
368, 402 


Digamma function, 376, 405 
Discrete variable, 24, 69 
Discriminant, perfect, 355, 359 
observed, 360 
theoretical, 358 

Discriminant function, 355-361 
Distance between means, 356 
generalized 362 

Distribution, asymptotically normal, 
132 

bivariate normal, 92 
joint, 67, 69, 70, 167 
marginal, 68 
singular, 92 

Distribution-free test, 223, 250 
Distribution function, 24, 42, 69, 111 
for ungrouped data, 112 
Dixon, W. J., 159 
Doob, J. L., 21 
Doolittle, M. H., 298 
Duncan, D. B., 322 
Dwyer, P. S., 298, 322 

Edgeworth, F. Y., 108 
Efficient estimation, 109, 119 
statistic, 371, 379 
Eisenhart, C., 287, 404 
Elderton, W. P., 126 
Elimination, systematic, 269, 298, 306 
Equations, condition, 318 
linear, 316 

normal, 207, 290, 317 
observation, 316 
of constraint, 318 
Error function, 43 

law {see Normal distribution) 

Errors, conditional, 273 
correlation between, 251 
experimental, 210, 243 
of first kind, 118, 232, 387, 389 
of second kind, 387, 389, 394, 399, 
400 

Estimate, standard error of, 201, 208, 
209, 211 

Estimates best, 171, 367 
efficient, 109 

inefficient (correction of), 380 
least squares, 172 

maximum likelihood, 172, 221, 261, 
311 

maximum probability, 172 



422 


Index 


minimum variance, 290 
modal, 172 

most-efficient, 109, 119, 371, 373, 379 
of X, given 336 
regular, 368 

unbiased, 110, 171, 173, 221, 261, 290, 
311, 367 

Estimation, Chap. XII 
interval, 129 

of components of variance, 260 
of treatment effects, 265 
point, 129 

Events, compound, 7, 8 
exclusive, 8 
independent, 10, 11 
simple, 7, 8. 

Excess (see Kurtosis) 

Expectation, 25, 26, 70, 79 
theorems, 71, 72 
Experimental design, 278 
Exponential regression, 332 
modified, 332, 334 
Extreme values, distribution of, 190 

Factorial, 5, 22, 23 
function, 56 
moments, 91 

moment generating function, 91 
Stirling approximation to, 49, 59 
F-distribution, 181-184 
significance levels of, 184 
Feller, W., 21, 32, 54, 90, 93 
Fermat, P. de, 13 
Fiducial inference, 260, 264, 385 
limits, 258, 259 
probability, 258n, 386 
Fieller, E. C., 366 

Finite parent population, 142, 149, 159 
Fisher, R, A. (references), 21, 93, 159, 
198, 236, 237, 287, 288, 322, 366, 
405, 406 

analysis of variance, 158 
approximation to 99 
Bayes^ theorem, 15, 17 
chi-square, 100, 117 
combination of probabilities, 233 
correlation coefficient, 217, 219, 220 
cumulants, 77 
definition of statistic, 127* 
efficiency of estimates, 376 
exact test, 231 


fiducial inference, 260, 264, 385 
.^-statistics, 190 
harmonic components, 336 
inequality on variance, 367, 370 
intra-class correlation, 283 ^ 

kurtosis, 27, 190 

levels of significance, 40, 118, 136 
multiple correlation, 350 
normal law, 43 
null hypothesis, 136 
regression, 315 
skewness, 27, 190 
^-distribution, 175, 178 
theorem, 100 
variance, 165, 167 
variance ratio, 182 
^-distribution, 180 
-transformation, 221, 253 
Fisher and Yates (tables), 43, 111, 159, 
176, 182, 193, 215, 236, 237, 259 
Latin squares, 280 
orthogonal polynomials, 329 
random numbers, 152 
Forsyth, A. R., 159 
Fortuyn, A. B. D., 237 
Fourier transform, 76, 93 
Freedom, degrees of, 98, 162, 167, 
263 

Frequency function, 23, 67, 215 
surface, 173 
test, 154 

Fry, T. C., 44, 45, 46, 119 
Function, Beta, 61, 64, 66 
characteristic, 72, 88, 93 
cumulative frequency, 24, 42, 69, 111 
factorial, 56 
frequency, 23, 67 
Gamma, 56, 62, 64, 117 
hypergeometric, 49, 219, 350 
linear, 93 

moment generating, 72, 93 
odd or even, 55 
orthogonal, 209 

Functional determinant (see Jacobian) 

Gaddum, J. H., 123 
Games, theory of, 32 
Gamma function, 56, 62, 64, 117 
variate, 94, 95, 101, 376 
Gap test of randomness, 154 
Gauss, C. F., 161, 298 
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Gaussian distribution {see Normal dis- 
tribution) 

Geary, R. C., 170, 197 
Generalized 7^-test, 362 
Glover, J. W. (tables), 23, 42, 59, 108, 
332, 365 

Goldstine, H. H., 322 
Gompertz curve, 364 
Goodness of fit, 109, 377 
Gosset, W. S. (see Student), 160, 197 
Goulden, C. H., 281, 287, 288 
Gowen, J. G., 267 
Graeco-Latin square, 280 
Gram-Cliarlier, A series, 108 
system of curves, 101, 107 
Grouping (see Sheppard’s corrections) 
Griffith, B. A., 366 
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